[ClusterLabs] Antw: [EXT] Re: node fencing due to "Stonith/shutdown of node .. was not expected" while the node shut down cleanly

Tue Feb 9 02:29:08 EST 2021

>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 08.02.2021 um 17:43 in
Nachricht
<5ee981d3893dd7712c747661de05240df1ccd8eb.camel at redhat.com>:
> On Mon, 2021‑02‑08 at 08:41 +0100, Ulrich Windl wrote:
>> Hi!
>> 
>> There were previous indications of this problem, but today I had it
>> again:
>> I restarted a node (h18, DC) via "crm cluster restart", and the node
>> shutdown cleanly (at least it came to an end), but when restarting,
>> the node was fenced by the new DC (h16):
>> 
>> Feb 08 08:12:24 h18 pacemaker‑controld[8293]:  notice: Transition 409
>> (Complete=23, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>> Source=/var/lib/pacemaker/pengine/pe‑input‑4.bz2): Complete
>> Feb 08 08:12:24 h18 pacemaker‑controld[8293]:  notice: Disconnected
>> from the executor
>> Feb 08 08:12:24 h18 pacemaker‑controld[8293]:  notice: Disconnected
>> from Corosync
>> Feb 08 08:12:24 h18 pacemaker‑controld[8293]:  notice: Disconnected
>> from the CIB manager
>> Feb 08 08:12:24 h18 pacemakerd[8286]:  notice: Stopping pacemaker‑
>> schedulerd
>> Feb 08 08:12:24 h18 pacemaker‑schedulerd[8292]:  notice: Caught
>> 'Terminated' signal
>> Feb 08 08:12:24 h18 pacemakerd[8286]:  notice: Stopping pacemaker‑
>> attrd
>> Feb 08 08:12:24 h18 pacemaker‑attrd[8291]:  notice: Caught
>> 'Terminated' signal
>> Feb 08 08:12:24 h18 pacemaker‑execd[8290]:  notice: Caught
>> 'Terminated' signal
>> Feb 08 08:12:24 h18 pacemakerd[8286]:  notice: Stopping pacemaker‑
>> execd
>> Feb 08 08:12:24 h18 pacemaker‑based[8287]:  warning:
>> new_event_notification (8287‑8291‑11): Broken pipe (32)
>> Feb 08 08:12:24 h18 pacemaker‑based[8287]:  warning: Notification of
>> client attrd/b836bbe9‑648e‑4b1a‑ae78‑b351d5ffa00b failed
>> Feb 08 08:12:24 h18 pacemakerd[8286]:  notice: Stopping pacemaker‑
>> fenced
>> Feb 08 08:12:24 h18 pacemaker‑fenced[8288]:  notice: Caught
>> 'Terminated' signal
>> Feb 08 08:12:24 h18 pacemakerd[8286]:  notice: Stopping pacemaker‑
>> based
>> Feb 08 08:12:24 h18 pacemaker‑based[8287]:  notice: Caught
>> 'Terminated' signal
>> Feb 08 08:12:24 h18 pacemaker‑based[8287]:  notice: Disconnected from
>> Corosync
>> Feb 08 08:12:24 h18 pacemaker‑based[8287]:  notice: Disconnected from
>> Corosync
>> Feb 08 08:12:24 h18 sbd[8264]:  warning: cleanup_servant_by_pid:
>> Servant for pcmk (pid: 8270) has terminated
>> Feb 08 08:12:24 h18 pacemakerd[8286]:  notice: Shutdown complete
>> Feb 08 08:12:24 h18 systemd[1]: Stopped Pacemaker High Availability
>> Cluster Manager.
>> Feb 08 08:12:24 h18 sbd[33927]:       pcmk:   notice: servant_pcmk:
>> Monitoring Pacemaker health
>> Feb 08 08:12:24 h18 systemd[1]: Stopping Shared‑storage based fencing
>> daemon...
>> Feb 08 08:12:24 h18 systemd[1]: Stopping Corosync Cluster Engine...
>> Feb 08 08:12:24 h18 sbd[8264]:  warning: cleanup_servant_by_pid:
>> Servant for pcmk (pid: 33927) has terminated
>> Feb 08 08:12:24 h18 sbd[8264]:  warning: cleanup_servant_by_pid:
>> Servant for cluster (pid: 8271) has terminated
>> Feb 08 08:12:24 h18 corosync[33929]: Signaling Corosync Cluster
>> Engine (corosync) to terminate: [  OK  ]
>> Feb 08 08:12:24 h18 corosync[8279]:   [MAIN  ] Node was shut down by
>> a signalFeb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Unloading all
>> Corosync service engines.
>> Feb 08 08:12:24 h18 corosync[8279]:   [QB    ] withdrawing server
>> sockets
>> Feb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Service engine
>> unloaded: corosync vote quorum service v1.0
>> Feb 08 08:12:24 h18 corosync[8279]:   [QB    ] withdrawing server
>> sockets
>> Feb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Service engine
>> unloaded: corosync configuration map access
>> Feb 08 08:12:24 h18 corosync[8279]:   [QB    ] withdrawing server
>> sockets
>> Feb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Service engine
>> unloaded: corosync configuration service
>> Feb 08 08:12:24 h18 corosync[8279]:   [QB    ] withdrawing server
>> sockets
>> Feb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Service engine
>> unloaded: corosync cluster closed process group service v1.01
>> Feb 08 08:12:24 h18 corosync[8279]:   [QB    ] withdrawing server
>> sockets
>> Feb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Service engine
>> unloaded: corosync cluster quorum service v0.1
>> Feb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Service engine
>> unloaded: corosync profile loading service
>> Feb 08 08:12:24 h18 corosync[8279]:   [MAIN  ] Corosync Cluster
>> Engine exiting normally
>> Feb 08 08:12:24 h18 corosync[33929]: Waiting for corosync services to
>> unload:[  OK  ]
>> Feb 08 08:12:24 h18 systemd[1]: Stopped Corosync Cluster Engine.
>> Feb 08 08:12:24 h18 sbd[8264]:  warning: cleanup_servant_by_pid:
>> Servant for /dev/disk/by‑id/dm‑name‑SBD_1‑3P1 (pid: 8268) has
>> terminated
>> Feb 08 08:12:24 h18 sbd[8264]:  warning: cleanup_servant_by_pid:
>> Servant for /dev/disk/by‑id/dm‑name‑SBD_1‑3P2 (pid: 8269) has
>> terminated
>> Feb 08 08:12:24 h18 systemd[1]: Stopped Shared‑storage based fencing
>> daemon.
>> 
>> So I hope we can agree that that was a "clean" shutdown of the node.
>> 
>> Now the new DC:
>> Feb 08 08:12:24 h16 pacemaker‑controld[7525]:  notice: Our peer on
>> the DC (h18) is dead
>> Feb 08 08:12:24 h16 pacemaker‑controld[7525]:  notice: State
>> transition S_NOT_DC ‑> S_ELECTION
>> Feb 08 08:12:24 h16 pacemaker‑controld[7525]:  notice: State
>> transition S_ELECTION ‑> S_INTEGRATION
>> Feb 08 08:12:24 h16 pacemaker‑attrd[7523]:  notice: Node h18 state is
>> now lost
>> Feb 08 08:12:24 h16 pacemaker‑attrd[7523]:  notice: Removing all h18
>> attributes for peer loss
>> Feb 08 08:12:24 h16 pacemaker‑attrd[7523]:  notice: Purged 1 peer
>> with id=118 and/or uname=h18 from the membership cache
>> Feb 08 08:12:24 h16 pacemaker‑fenced[7521]:  notice: Node h18 state
>> is now lost
>> Feb 08 08:12:24 h16 pacemaker‑fenced[7521]:  notice: Purged 1 peer
>> with id=118 and/or uname=h18 from the membership cache
>> Feb 08 08:12:24 h16 pacemaker‑based[7520]:  notice: Node h18 state is
>> now lost
>> Feb 08 08:12:24 h16 pacemaker‑based[7520]:  notice: Purged 1 peer
>> with id=118 and/or uname=h18 from the membership cache
>> Feb 08 08:12:24 h16 corosync[6567]:   [TOTEM ] A new membership
>> (172.20.16.16:42448) was formed. Members left: 118
>> Feb 08 08:12:24 h16 corosync[6567]:   [CPG   ] downlist left_list: 1
>> received
>> Feb 08 08:12:24 h16 corosync[6567]:   [CPG   ] downlist left_list: 1
>> received
>> Feb 08 08:12:24 h16 corosync[6567]:   [QUORUM] Members[2]: 116 119
>> Feb 08 08:12:24 h16 corosync[6567]:   [MAIN  ] Completed service
>> synchronization, ready to provide service.
>> Feb 08 08:12:24 h16 pacemakerd[7518]:  notice: Node h18 state is now
>> lost
>> Feb 08 08:12:24 h16 pacemaker‑controld[7525]:  notice: Node h18 state
>> is now lost
>> Feb 08 08:12:24 h16 pacemaker‑controld[7525]:  warning:
>> Stonith/shutdown of node h18 was not expected
>> Feb 08 08:12:24 h16 kernel: dlm: closing connection to node 118
>> Feb 08 08:12:25 h16 pacemaker‑controld[7525]:  notice: Updating
>> quorum status to true (call=160)
>> Feb 08 08:12:25 h16 pacemaker‑schedulerd[7524]:  notice: Watchdog
>> will be used via SBD if fencing is required and stonith‑watchdog‑
>> timeout is nonzero
>> 
>> Why isn't the shutdown expected? h18 had "announced" its shutdown:
> 
> This isn't a problem ‑‑ h16 didn't expect the shutdown because it
> wasn't DC, so it had no information from the scheduler to base any
> expectations on. Once it runs the scheduler, it should have the correct
> information.
> 
>> Feb 08 08:11:52 h18 pacemakerd[8286]:  notice: Caught 'Terminated'
>> signal
>> Feb 08 08:11:52 h18 pacemakerd[8286]:  notice: Shutting down
>> Pacemaker
>> Feb 08 08:11:52 h18 pacemakerd[8286]:  notice: Stopping pacemaker‑
>> controld
>> Feb 08 08:11:52 h18 pacemaker‑controld[8293]:  notice: Caught
>> 'Terminated' signal
>> Feb 08 08:11:52 h18 pacemaker‑controld[8293]:  notice: Initiating
>> controller shutdown sequence
>> Feb 08 08:11:52 h18 pacemaker‑controld[8293]:  notice: State
>> transition S_IDLE ‑> S_POLICY_ENGINE
>> Feb 08 08:11:52 h18 pacemaker‑attrd[8291]:  notice: Setting
>> shutdown[h18]: (unset) ‑> 1612768312
>> 
>> So when h18 came up again, it was fenced:
>> Feb 08 08:12:25 h16 pacemaker‑schedulerd[7524]:  warning: Scheduling
>> Node h18 for STONITH
>> Feb 08 08:12:25 h16 pacemaker‑schedulerd[7524]:  notice:  * Fence
>> (reboot) h18 'resource actions are unrunnable'
> 
> This looks unrelated to me. The question is why it thinks actions are
> unrunnable.

Hi Ken!

(Once again thanks for looking at it.)
Because the node had shut down cleanly? I mean: If all the resources had
stopped cleanly and the node did, there might be some constraints preventing
resources from starting, but why fence the node?
I suspect that the "node shutdown" information either is not passed over to
the new DC, or that information is deleted by mistake. In any case it doesn't
look right.

BTW I could reproduce the problem with a different DC.

Regards,
Ulrich

> 
>> Feb 08 08:12:25 h16 pacemaker‑schedulerd[7524]:  warning: Calculated
>> transition 0 (with warnings), saving inputs in
>> /var/lib/pacemaker/pengine/pe‑warn‑36.bz2
>> Feb 08 08:12:25 h16 pacemaker‑controld[7525]:  notice: Processing
>> graph 0 (ref=pe_calc‑dc‑1612768345‑16) derived from
>> /var/lib/pacemaker/pengine/pe‑warn‑36.bz2
>> Feb 08 08:12:25 h16 pacemaker‑controld[7525]:  notice: Requesting
>> fencing (reboot) of node h18
>> Feb 08 08:12:25 h16 pacemaker‑fenced[7521]:  notice: Client
>> pacemaker‑controld.7525.4c7f7600 wants to fence (reboot) 'h18' with
>> device '(any)'
>> Feb 08 08:12:25 h16 pacemaker‑fenced[7521]:  notice: Requesting peer
>> fencing (reboot) targeting h18
>> Feb 08 08:12:30 h16 corosync[6567]:   [TOTEM ] A new membership
>> (172.20.16.16:42456) was formed. Members joined: 118
>> Feb 08 08:12:30 h16 corosync[6567]:   [CPG   ] downlist left_list: 0
>> received
>> Feb 08 08:12:30 h16 corosync[6567]:   [CPG   ] downlist left_list: 0
>> received
>> Feb 08 08:12:30 h16 corosync[6567]:   [CPG   ] downlist left_list: 0
>> received
>> Feb 08 08:12:30 h16 corosync[6567]:   [QUORUM] Members[3]: 116 118
>> 119
>> Feb 08 08:12:30 h16 corosync[6567]:   [MAIN  ] Completed service
>> synchronization, ready to provide service.
>> Feb 08 08:12:30 h16 pacemaker‑controld[7525]:  notice: Node h18 state
>> is now member
>> Feb 08 08:12:30 h16 pacemakerd[7518]:  notice: Node h18 state is now
>> member
>> Feb 08 08:12:30 h16 pacemaker‑fenced[7521]:  notice: Requesting that
>> h16 perform 'reboot' action targeting h18
>> Feb 08 08:12:30 h16 pacemaker‑fenced[7521]:  notice: prm_stonith_sbd
>> is eligible to fence (reboot) h18: dynamic‑list
>> Feb 08 08:12:37 h16 corosync[6567]:   [TOTEM ] A processor failed,
>> forming new configuration.
>> Feb 08 08:12:44 h16 corosync[6567]:   [TOTEM ] A new membership
>> (172.20.16.16:42460) was formed. Members left: 118
>> Feb 08 08:12:44 h16 corosync[6567]:   [TOTEM ] Failed to receive the
>> leave message. failed: 118
>> Feb 08 08:12:44 h16 corosync[6567]:   [CPG   ] downlist left_list: 1
>> received
>> Feb 08 08:12:44 h16 corosync[6567]:   [CPG   ] downlist left_list: 1
>> received
>> Feb 08 08:12:44 h16 corosync[6567]:   [QUORUM] Members[2]: 116 119
>> Feb 08 08:12:44 h16 corosync[6567]:   [MAIN  ] Completed service
>> synchronization, ready to provide service.
>> Feb 08 08:12:44 h16 pacemaker‑controld[7525]:  notice: Node h18 state
>> is now lost
>> Feb 08 08:12:44 h16 pacemakerd[7518]:  notice: Node h18 state is now
>> lost
>> 
>> Feb 08 08:12:24 h18 systemd[1]: Starting Shared‑storage based fencing
>> daemon...
>> Feb 08 08:12:24 h18 systemd[1]: Starting Corosync Cluster Engine...
>> Feb 08 08:12:24 h18 sbd[33948]:   notice: main: Do flush + write 'b'
>> to sysrq in case of timeout
>> Feb 08 08:12:25 h18 sbd[33958]: /dev/disk/by‑id/dm‑name‑SBD_1‑
>> 3P1:   notice: servant_md: Monitoring slot 4 on disk /dev/disk/by‑
>> id/dm‑name‑SBD_1‑3P1
>> Feb 08 08:12:25 h18 sbd[33959]: /dev/disk/by‑id/dm‑name‑SBD_1‑
>> 3P2:   notice: servant_md: Monitoring slot 4 on disk /dev/disk/by‑
>> id/dm‑name‑SBD_1‑3P2
>> Feb 08 08:12:25 h18 sbd[33960]:       pcmk:   notice: servant_pcmk:
>> Monitoring Pacemaker health
>> Feb 08 08:12:25 h18 sbd[33962]:    cluster:   notice:
>> servant_cluster: Monitoring unknown cluster health
>> Feb 08 08:12:25 h18 corosync[33964]:   [MAIN  ] Corosync Cluster
>> Engine ('2.4.5'): started and ready to provide service.
>> Feb 08 08:12:25 h18 corosync[33964]:   [MAIN  ] Corosync built‑in
>> features: testagents systemd qdevices qnetd pie relro bindnow
>> Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] Initializing
>> transport (UDP/IP Unicast).
>> Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] Initializing
>> transmit/receive security (NSS) crypto: aes256 hash: sha1
>> Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] Initializing
>> transport (UDP/IP Unicast).
>> Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] Initializing
>> transmit/receive security (NSS) crypto: aes256 hash: sha1
>> Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] The network interface
>> [172.20.16.18] is now up.
>> Feb 08 08:12:25 h18 corosync[33967]:   [SERV  ] Service engine
>> loaded: corosync configuration map access [0]
>> Feb 08 08:12:25 h18 corosync[33967]:   [QB    ] server name: cmap
>> Feb 08 08:12:25 h18 corosync[33967]:   [SERV  ] Service engine
>> loaded: corosync configuration service [1]
>> Feb 08 08:12:25 h18 corosync[33967]:   [QB    ] server name: cfg
>> Feb 08 08:12:25 h18 corosync[33967]:   [SERV  ] Service engine
>> loaded: corosync cluster closed process group service v1.01 [2]
>> Feb 08 08:12:25 h18 corosync[33967]:   [QB    ] server name: cpg
>> Feb 08 08:12:25 h18 corosync[33967]:   [SERV  ] Service engine
>> loaded: corosync profile loading service [4]
>> Feb 08 08:12:25 h18 corosync[33967]:   [QUORUM] Using quorum provider
>> corosync_votequorum
>> Feb 08 08:12:25 h18 corosync[33967]:   [SERV  ] Service engine
>> loaded: corosync vote quorum service v1.0 [5]
>> Feb 08 08:12:25 h18 corosync[33967]:   [QB    ] server name:
>> votequorum
>> Feb 08 08:12:25 h18 corosync[33967]:   [SERV  ] Service engine
>> loaded: corosync cluster quorum service v0.1 [3]
>> Feb 08 08:12:25 h18 corosync[33967]:   [QB    ] server name: quorum
>> Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] adding new UDPU
>> member {172.20.16.16}
>> Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] adding new UDPU
>> member {172.20.16.18}
>> Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] adding new UDPU
>> member {172.20.16.19}
>> Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] The network interface
>> [10.2.2.18] is now up.
>> Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] adding new UDPU
>> member {10.2.2.16}
>> Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] adding new UDPU
>> member {10.2.2.18}
>> Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] adding new UDPU
>> member {10.2.2.19}
>> Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] A new membership
>> (172.20.16.18:42448) was formed. Members joined: 118
>> Feb 08 08:12:25 h18 corosync[33967]:   [CPG   ] downlist left_list: 0
>> received
>> Feb 08 08:12:25 h18 corosync[33967]:   [QUORUM] Members[1]: 118
>> Feb 08 08:12:25 h18 corosync[33967]:   [MAIN  ] Completed service
>> synchronization, ready to provide service.
>> Feb 08 08:12:25 h18 corosync[33950]: Starting Corosync Cluster Engine
>> (corosync): [  OK  ]
>> Feb 08 08:12:25 h18 systemd[1]: Started Corosync Cluster Engine.
>> Feb 08 08:12:30 h18 corosync[33967]:   [TOTEM ] A new membership
>> (172.20.16.16:42456) was formed. Members joined: 116 119
>> Feb 08 08:12:30 h18 corosync[33967]:   [CPG   ] downlist left_list: 0
>> received
>> Feb 08 08:12:30 h18 corosync[33967]:   [CPG   ] downlist left_list: 0
>> received
>> Feb 08 08:12:30 h18 corosync[33967]:   [CPG   ] downlist left_list: 0
>> received
>> Feb 08 08:12:30 h18 corosync[33967]:   [QUORUM] This node is within
>> the primary component and will provide service.
>> Feb 08 08:12:30 h18 corosync[33967]:   [QUORUM] Members[3]: 116 118
>> 119
>> Feb 08 08:12:30 h18 corosync[33967]:   [MAIN  ] Completed service
>> synchronization, ready to provide service.
>> Feb 08 08:12:30 h18 sbd[33953]:   notice: inquisitor_child: Servant
>> cluster is healthy (age: 0)
>> 
>> Broadcast message from systemd‑journald at h18 (Mon 2021‑02‑08 08:12:32
>> CET):
>> 
>> sbd[33953]:    emerg: do_exit: Rebooting system: reboot
>> 
>> Feb 08 08:12:32 h18 sbd[33958]: /dev/disk/by‑id/dm‑name‑SBD_1‑
>> 3P1:   notice: servant_md: Received command reset from h16 on disk
>> /dev/disk/by‑id/dm‑name‑SBD_1‑3P1
>> Feb 08 08:12:32 h18 sbd[33959]: /dev/disk/by‑id/dm‑name‑SBD_1‑
>> 3P2:   notice: servant_md: Received command reset from h16 on disk
>> /dev/disk/by‑id/dm‑name‑SBD_1‑3P2
>> Feb 08 08:12:32 h18 sbd[33953]:  warning: inquisitor_child:
>> /dev/disk/by‑id/dm‑name‑SBD_1‑3P1 requested a reset
>> Feb 08 08:12:32 h18 sbd[33953]:    emerg: do_exit: Rebooting system:
>> reboot
>> 
>> OS/software:
>> SLES15 SP2
>> ph16:~ # rpm ‑q pacemaker corosync
>> pacemaker‑2.0.4+20200616.2deceaa3a‑3.3.1.x86_64
>> corosync‑2.4.5‑6.3.2.x86_64
>> 
>> Regards,
>> Ulrich
>> 
>> 
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
>> 
> ‑‑ 
> Ken Gaillot <kgaillot at redhat.com>
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/