[ClusterLabs] node fencing due to "Stonith/shutdown of node .. was not expected" while the node shut down cleanly

Mon Feb 8 11:43:17 EST 2021

On Mon, 2021-02-08 at 08:41 +0100, Ulrich Windl wrote:
> Hi!
> 
> There were previous indications of this problem, but today I had it
> again:
> I restarted a node (h18, DC) via "crm cluster restart", and the node
> shutdown cleanly (at least it came to an end), but when restarting,
> the node was fenced by the new DC (h16):
> 
> Feb 08 08:12:24 h18 pacemaker-controld[8293]:  notice: Transition 409
> (Complete=23, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-4.bz2): Complete
> Feb 08 08:12:24 h18 pacemaker-controld[8293]:  notice: Disconnected
> from the executor
> Feb 08 08:12:24 h18 pacemaker-controld[8293]:  notice: Disconnected
> from Corosync
> Feb 08 08:12:24 h18 pacemaker-controld[8293]:  notice: Disconnected
> from the CIB manager
> Feb 08 08:12:24 h18 pacemakerd[8286]:  notice: Stopping pacemaker-
> schedulerd
> Feb 08 08:12:24 h18 pacemaker-schedulerd[8292]:  notice: Caught
> 'Terminated' signal
> Feb 08 08:12:24 h18 pacemakerd[8286]:  notice: Stopping pacemaker-
> attrd
> Feb 08 08:12:24 h18 pacemaker-attrd[8291]:  notice: Caught
> 'Terminated' signal
> Feb 08 08:12:24 h18 pacemaker-execd[8290]:  notice: Caught
> 'Terminated' signal
> Feb 08 08:12:24 h18 pacemakerd[8286]:  notice: Stopping pacemaker-
> execd
> Feb 08 08:12:24 h18 pacemaker-based[8287]:  warning:
> new_event_notification (8287-8291-11): Broken pipe (32)
> Feb 08 08:12:24 h18 pacemaker-based[8287]:  warning: Notification of
> client attrd/b836bbe9-648e-4b1a-ae78-b351d5ffa00b failed
> Feb 08 08:12:24 h18 pacemakerd[8286]:  notice: Stopping pacemaker-
> fenced
> Feb 08 08:12:24 h18 pacemaker-fenced[8288]:  notice: Caught
> 'Terminated' signal
> Feb 08 08:12:24 h18 pacemakerd[8286]:  notice: Stopping pacemaker-
> based
> Feb 08 08:12:24 h18 pacemaker-based[8287]:  notice: Caught
> 'Terminated' signal
> Feb 08 08:12:24 h18 pacemaker-based[8287]:  notice: Disconnected from
> Corosync
> Feb 08 08:12:24 h18 pacemaker-based[8287]:  notice: Disconnected from
> Corosync
> Feb 08 08:12:24 h18 sbd[8264]:  warning: cleanup_servant_by_pid:
> Servant for pcmk (pid: 8270) has terminated
> Feb 08 08:12:24 h18 pacemakerd[8286]:  notice: Shutdown complete
> Feb 08 08:12:24 h18 systemd[1]: Stopped Pacemaker High Availability
> Cluster Manager.
> Feb 08 08:12:24 h18 sbd[33927]:       pcmk:   notice: servant_pcmk:
> Monitoring Pacemaker health
> Feb 08 08:12:24 h18 systemd[1]: Stopping Shared-storage based fencing
> daemon...
> Feb 08 08:12:24 h18 systemd[1]: Stopping Corosync Cluster Engine...
> Feb 08 08:12:24 h18 sbd[8264]:  warning: cleanup_servant_by_pid:
> Servant for pcmk (pid: 33927) has terminated
> Feb 08 08:12:24 h18 sbd[8264]:  warning: cleanup_servant_by_pid:
> Servant for cluster (pid: 8271) has terminated
> Feb 08 08:12:24 h18 corosync[33929]: Signaling Corosync Cluster
> Engine (corosync) to terminate: [  OK  ]
> Feb 08 08:12:24 h18 corosync[8279]:   [MAIN  ] Node was shut down by
> a signalFeb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Unloading all
> Corosync service engines.
> Feb 08 08:12:24 h18 corosync[8279]:   [QB    ] withdrawing server
> sockets
> Feb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Service engine
> unloaded: corosync vote quorum service v1.0
> Feb 08 08:12:24 h18 corosync[8279]:   [QB    ] withdrawing server
> sockets
> Feb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Service engine
> unloaded: corosync configuration map access
> Feb 08 08:12:24 h18 corosync[8279]:   [QB    ] withdrawing server
> sockets
> Feb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Service engine
> unloaded: corosync configuration service
> Feb 08 08:12:24 h18 corosync[8279]:   [QB    ] withdrawing server
> sockets
> Feb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Service engine
> unloaded: corosync cluster closed process group service v1.01
> Feb 08 08:12:24 h18 corosync[8279]:   [QB    ] withdrawing server
> sockets
> Feb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Service engine
> unloaded: corosync cluster quorum service v0.1
> Feb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Service engine
> unloaded: corosync profile loading service
> Feb 08 08:12:24 h18 corosync[8279]:   [MAIN  ] Corosync Cluster
> Engine exiting normally
> Feb 08 08:12:24 h18 corosync[33929]: Waiting for corosync services to
> unload:[  OK  ]
> Feb 08 08:12:24 h18 systemd[1]: Stopped Corosync Cluster Engine.
> Feb 08 08:12:24 h18 sbd[8264]:  warning: cleanup_servant_by_pid:
> Servant for /dev/disk/by-id/dm-name-SBD_1-3P1 (pid: 8268) has
> terminated
> Feb 08 08:12:24 h18 sbd[8264]:  warning: cleanup_servant_by_pid:
> Servant for /dev/disk/by-id/dm-name-SBD_1-3P2 (pid: 8269) has
> terminated
> Feb 08 08:12:24 h18 systemd[1]: Stopped Shared-storage based fencing
> daemon.
> 
> So I hope we can agree that that was a "clean" shutdown of the node.
> 
> Now the new DC:
> Feb 08 08:12:24 h16 pacemaker-controld[7525]:  notice: Our peer on
> the DC (h18) is dead
> Feb 08 08:12:24 h16 pacemaker-controld[7525]:  notice: State
> transition S_NOT_DC -> S_ELECTION
> Feb 08 08:12:24 h16 pacemaker-controld[7525]:  notice: State
> transition S_ELECTION -> S_INTEGRATION
> Feb 08 08:12:24 h16 pacemaker-attrd[7523]:  notice: Node h18 state is
> now lost
> Feb 08 08:12:24 h16 pacemaker-attrd[7523]:  notice: Removing all h18
> attributes for peer loss
> Feb 08 08:12:24 h16 pacemaker-attrd[7523]:  notice: Purged 1 peer
> with id=118 and/or uname=h18 from the membership cache
> Feb 08 08:12:24 h16 pacemaker-fenced[7521]:  notice: Node h18 state
> is now lost
> Feb 08 08:12:24 h16 pacemaker-fenced[7521]:  notice: Purged 1 peer
> with id=118 and/or uname=h18 from the membership cache
> Feb 08 08:12:24 h16 pacemaker-based[7520]:  notice: Node h18 state is
> now lost
> Feb 08 08:12:24 h16 pacemaker-based[7520]:  notice: Purged 1 peer
> with id=118 and/or uname=h18 from the membership cache
> Feb 08 08:12:24 h16 corosync[6567]:   [TOTEM ] A new membership
> (172.20.16.16:42448) was formed. Members left: 118
> Feb 08 08:12:24 h16 corosync[6567]:   [CPG   ] downlist left_list: 1
> received
> Feb 08 08:12:24 h16 corosync[6567]:   [CPG   ] downlist left_list: 1
> received
> Feb 08 08:12:24 h16 corosync[6567]:   [QUORUM] Members[2]: 116 119
> Feb 08 08:12:24 h16 corosync[6567]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Feb 08 08:12:24 h16 pacemakerd[7518]:  notice: Node h18 state is now
> lost
> Feb 08 08:12:24 h16 pacemaker-controld[7525]:  notice: Node h18 state
> is now lost
> Feb 08 08:12:24 h16 pacemaker-controld[7525]:  warning:
> Stonith/shutdown of node h18 was not expected
> Feb 08 08:12:24 h16 kernel: dlm: closing connection to node 118
> Feb 08 08:12:25 h16 pacemaker-controld[7525]:  notice: Updating
> quorum status to true (call=160)
> Feb 08 08:12:25 h16 pacemaker-schedulerd[7524]:  notice: Watchdog
> will be used via SBD if fencing is required and stonith-watchdog-
> timeout is nonzero
> 
> Why isn't the shutdown expected? h18 had "announced" its shutdown:

This isn't a problem -- h16 didn't expect the shutdown because it
wasn't DC, so it had no information from the scheduler to base any
expectations on. Once it runs the scheduler, it should have the correct
information.

> Feb 08 08:11:52 h18 pacemakerd[8286]:  notice: Caught 'Terminated'
> signal
> Feb 08 08:11:52 h18 pacemakerd[8286]:  notice: Shutting down
> Pacemaker
> Feb 08 08:11:52 h18 pacemakerd[8286]:  notice: Stopping pacemaker-
> controld
> Feb 08 08:11:52 h18 pacemaker-controld[8293]:  notice: Caught
> 'Terminated' signal
> Feb 08 08:11:52 h18 pacemaker-controld[8293]:  notice: Initiating
> controller shutdown sequence
> Feb 08 08:11:52 h18 pacemaker-controld[8293]:  notice: State
> transition S_IDLE -> S_POLICY_ENGINE
> Feb 08 08:11:52 h18 pacemaker-attrd[8291]:  notice: Setting
> shutdown[h18]: (unset) -> 1612768312
> 
> So when h18 came up again, it was fenced:
> Feb 08 08:12:25 h16 pacemaker-schedulerd[7524]:  warning: Scheduling
> Node h18 for STONITH
> Feb 08 08:12:25 h16 pacemaker-schedulerd[7524]:  notice:  * Fence
> (reboot) h18 'resource actions are unrunnable'

This looks unrelated to me. The question is why it thinks actions are
unrunnable.

> Feb 08 08:12:25 h16 pacemaker-schedulerd[7524]:  warning: Calculated
> transition 0 (with warnings), saving inputs in
> /var/lib/pacemaker/pengine/pe-warn-36.bz2
> Feb 08 08:12:25 h16 pacemaker-controld[7525]:  notice: Processing
> graph 0 (ref=pe_calc-dc-1612768345-16) derived from
> /var/lib/pacemaker/pengine/pe-warn-36.bz2
> Feb 08 08:12:25 h16 pacemaker-controld[7525]:  notice: Requesting
> fencing (reboot) of node h18
> Feb 08 08:12:25 h16 pacemaker-fenced[7521]:  notice: Client
> pacemaker-controld.7525.4c7f7600 wants to fence (reboot) 'h18' with
> device '(any)'
> Feb 08 08:12:25 h16 pacemaker-fenced[7521]:  notice: Requesting peer
> fencing (reboot) targeting h18
> Feb 08 08:12:30 h16 corosync[6567]:   [TOTEM ] A new membership
> (172.20.16.16:42456) was formed. Members joined: 118
> Feb 08 08:12:30 h16 corosync[6567]:   [CPG   ] downlist left_list: 0
> received
> Feb 08 08:12:30 h16 corosync[6567]:   [CPG   ] downlist left_list: 0
> received
> Feb 08 08:12:30 h16 corosync[6567]:   [CPG   ] downlist left_list: 0
> received
> Feb 08 08:12:30 h16 corosync[6567]:   [QUORUM] Members[3]: 116 118
> 119
> Feb 08 08:12:30 h16 corosync[6567]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Feb 08 08:12:30 h16 pacemaker-controld[7525]:  notice: Node h18 state
> is now member
> Feb 08 08:12:30 h16 pacemakerd[7518]:  notice: Node h18 state is now
> member
> Feb 08 08:12:30 h16 pacemaker-fenced[7521]:  notice: Requesting that
> h16 perform 'reboot' action targeting h18
> Feb 08 08:12:30 h16 pacemaker-fenced[7521]:  notice: prm_stonith_sbd
> is eligible to fence (reboot) h18: dynamic-list
> Feb 08 08:12:37 h16 corosync[6567]:   [TOTEM ] A processor failed,
> forming new configuration.
> Feb 08 08:12:44 h16 corosync[6567]:   [TOTEM ] A new membership
> (172.20.16.16:42460) was formed. Members left: 118
> Feb 08 08:12:44 h16 corosync[6567]:   [TOTEM ] Failed to receive the
> leave message. failed: 118
> Feb 08 08:12:44 h16 corosync[6567]:   [CPG   ] downlist left_list: 1
> received
> Feb 08 08:12:44 h16 corosync[6567]:   [CPG   ] downlist left_list: 1
> received
> Feb 08 08:12:44 h16 corosync[6567]:   [QUORUM] Members[2]: 116 119
> Feb 08 08:12:44 h16 corosync[6567]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Feb 08 08:12:44 h16 pacemaker-controld[7525]:  notice: Node h18 state
> is now lost
> Feb 08 08:12:44 h16 pacemakerd[7518]:  notice: Node h18 state is now
> lost
> 
> Feb 08 08:12:24 h18 systemd[1]: Starting Shared-storage based fencing
> daemon...
> Feb 08 08:12:24 h18 systemd[1]: Starting Corosync Cluster Engine...
> Feb 08 08:12:24 h18 sbd[33948]:   notice: main: Do flush + write 'b'
> to sysrq in case of timeout
> Feb 08 08:12:25 h18 sbd[33958]: /dev/disk/by-id/dm-name-SBD_1-
> 3P1:   notice: servant_md: Monitoring slot 4 on disk /dev/disk/by-
> id/dm-name-SBD_1-3P1
> Feb 08 08:12:25 h18 sbd[33959]: /dev/disk/by-id/dm-name-SBD_1-
> 3P2:   notice: servant_md: Monitoring slot 4 on disk /dev/disk/by-
> id/dm-name-SBD_1-3P2
> Feb 08 08:12:25 h18 sbd[33960]:       pcmk:   notice: servant_pcmk:
> Monitoring Pacemaker health
> Feb 08 08:12:25 h18 sbd[33962]:    cluster:   notice:
> servant_cluster: Monitoring unknown cluster health
> Feb 08 08:12:25 h18 corosync[33964]:   [MAIN  ] Corosync Cluster
> Engine ('2.4.5'): started and ready to provide service.
> Feb 08 08:12:25 h18 corosync[33964]:   [MAIN  ] Corosync built-in
> features: testagents systemd qdevices qnetd pie relro bindnow
> Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] Initializing
> transport (UDP/IP Unicast).
> Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] Initializing
> transmit/receive security (NSS) crypto: aes256 hash: sha1
> Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] Initializing
> transport (UDP/IP Unicast).
> Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] Initializing
> transmit/receive security (NSS) crypto: aes256 hash: sha1
> Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] The network interface
> [172.20.16.18] is now up.
> Feb 08 08:12:25 h18 corosync[33967]:   [SERV  ] Service engine
> loaded: corosync configuration map access [0]
> Feb 08 08:12:25 h18 corosync[33967]:   [QB    ] server name: cmap
> Feb 08 08:12:25 h18 corosync[33967]:   [SERV  ] Service engine
> loaded: corosync configuration service [1]
> Feb 08 08:12:25 h18 corosync[33967]:   [QB    ] server name: cfg
> Feb 08 08:12:25 h18 corosync[33967]:   [SERV  ] Service engine
> loaded: corosync cluster closed process group service v1.01 [2]
> Feb 08 08:12:25 h18 corosync[33967]:   [QB    ] server name: cpg
> Feb 08 08:12:25 h18 corosync[33967]:   [SERV  ] Service engine
> loaded: corosync profile loading service [4]
> Feb 08 08:12:25 h18 corosync[33967]:   [QUORUM] Using quorum provider
> corosync_votequorum
> Feb 08 08:12:25 h18 corosync[33967]:   [SERV  ] Service engine
> loaded: corosync vote quorum service v1.0 [5]
> Feb 08 08:12:25 h18 corosync[33967]:   [QB    ] server name:
> votequorum
> Feb 08 08:12:25 h18 corosync[33967]:   [SERV  ] Service engine
> loaded: corosync cluster quorum service v0.1 [3]
> Feb 08 08:12:25 h18 corosync[33967]:   [QB    ] server name: quorum
> Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] adding new UDPU
> member {172.20.16.16}
> Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] adding new UDPU
> member {172.20.16.18}
> Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] adding new UDPU
> member {172.20.16.19}
> Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] The network interface
> [10.2.2.18] is now up.
> Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] adding new UDPU
> member {10.2.2.16}
> Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] adding new UDPU
> member {10.2.2.18}
> Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] adding new UDPU
> member {10.2.2.19}
> Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] A new membership
> (172.20.16.18:42448) was formed. Members joined: 118
> Feb 08 08:12:25 h18 corosync[33967]:   [CPG   ] downlist left_list: 0
> received
> Feb 08 08:12:25 h18 corosync[33967]:   [QUORUM] Members[1]: 118
> Feb 08 08:12:25 h18 corosync[33967]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Feb 08 08:12:25 h18 corosync[33950]: Starting Corosync Cluster Engine
> (corosync): [  OK  ]
> Feb 08 08:12:25 h18 systemd[1]: Started Corosync Cluster Engine.
> Feb 08 08:12:30 h18 corosync[33967]:   [TOTEM ] A new membership
> (172.20.16.16:42456) was formed. Members joined: 116 119
> Feb 08 08:12:30 h18 corosync[33967]:   [CPG   ] downlist left_list: 0
> received
> Feb 08 08:12:30 h18 corosync[33967]:   [CPG   ] downlist left_list: 0
> received
> Feb 08 08:12:30 h18 corosync[33967]:   [CPG   ] downlist left_list: 0
> received
> Feb 08 08:12:30 h18 corosync[33967]:   [QUORUM] This node is within
> the primary component and will provide service.
> Feb 08 08:12:30 h18 corosync[33967]:   [QUORUM] Members[3]: 116 118
> 119
> Feb 08 08:12:30 h18 corosync[33967]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Feb 08 08:12:30 h18 sbd[33953]:   notice: inquisitor_child: Servant
> cluster is healthy (age: 0)
> 
> Broadcast message from systemd-journald at h18 (Mon 2021-02-08 08:12:32
> CET):
> 
> sbd[33953]:    emerg: do_exit: Rebooting system: reboot
> 
> Feb 08 08:12:32 h18 sbd[33958]: /dev/disk/by-id/dm-name-SBD_1-
> 3P1:   notice: servant_md: Received command reset from h16 on disk
> /dev/disk/by-id/dm-name-SBD_1-3P1
> Feb 08 08:12:32 h18 sbd[33959]: /dev/disk/by-id/dm-name-SBD_1-
> 3P2:   notice: servant_md: Received command reset from h16 on disk
> /dev/disk/by-id/dm-name-SBD_1-3P2
> Feb 08 08:12:32 h18 sbd[33953]:  warning: inquisitor_child:
> /dev/disk/by-id/dm-name-SBD_1-3P1 requested a reset
> Feb 08 08:12:32 h18 sbd[33953]:    emerg: do_exit: Rebooting system:
> reboot
> 
> OS/software:
> SLES15 SP2
> ph16:~ # rpm -q pacemaker corosync
> pacemaker-2.0.4+20200616.2deceaa3a-3.3.1.x86_64
> corosync-2.4.5-6.3.2.x86_64
> 
> Regards,
> Ulrich
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 
-- 
Ken Gaillot <kgaillot at redhat.com>