[ClusterLabs] node fencing due to "Stonith/shutdown of node .. was not expected" while the node shut down cleanly

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Mon Feb 8 02:41:47 EST 2021


Hi!

There were previous indications of this problem, but today I had it again:
I restarted a node (h18, DC) via "crm cluster restart", and the node shutdown cleanly (at least it came to an end), but when restarting, the node was fenced by the new DC (h16):

Feb 08 08:12:24 h18 pacemaker-controld[8293]:  notice: Transition 409 (Complete=23, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-4.bz2): Complete
Feb 08 08:12:24 h18 pacemaker-controld[8293]:  notice: Disconnected from the executor
Feb 08 08:12:24 h18 pacemaker-controld[8293]:  notice: Disconnected from Corosync
Feb 08 08:12:24 h18 pacemaker-controld[8293]:  notice: Disconnected from the CIB manager
Feb 08 08:12:24 h18 pacemakerd[8286]:  notice: Stopping pacemaker-schedulerd
Feb 08 08:12:24 h18 pacemaker-schedulerd[8292]:  notice: Caught 'Terminated' signal
Feb 08 08:12:24 h18 pacemakerd[8286]:  notice: Stopping pacemaker-attrd
Feb 08 08:12:24 h18 pacemaker-attrd[8291]:  notice: Caught 'Terminated' signal
Feb 08 08:12:24 h18 pacemaker-execd[8290]:  notice: Caught 'Terminated' signal
Feb 08 08:12:24 h18 pacemakerd[8286]:  notice: Stopping pacemaker-execd
Feb 08 08:12:24 h18 pacemaker-based[8287]:  warning: new_event_notification (8287-8291-11): Broken pipe (32)
Feb 08 08:12:24 h18 pacemaker-based[8287]:  warning: Notification of client attrd/b836bbe9-648e-4b1a-ae78-b351d5ffa00b failed
Feb 08 08:12:24 h18 pacemakerd[8286]:  notice: Stopping pacemaker-fenced
Feb 08 08:12:24 h18 pacemaker-fenced[8288]:  notice: Caught 'Terminated' signal
Feb 08 08:12:24 h18 pacemakerd[8286]:  notice: Stopping pacemaker-based
Feb 08 08:12:24 h18 pacemaker-based[8287]:  notice: Caught 'Terminated' signal
Feb 08 08:12:24 h18 pacemaker-based[8287]:  notice: Disconnected from Corosync
Feb 08 08:12:24 h18 pacemaker-based[8287]:  notice: Disconnected from Corosync
Feb 08 08:12:24 h18 sbd[8264]:  warning: cleanup_servant_by_pid: Servant for pcmk (pid: 8270) has terminated
Feb 08 08:12:24 h18 pacemakerd[8286]:  notice: Shutdown complete
Feb 08 08:12:24 h18 systemd[1]: Stopped Pacemaker High Availability Cluster Manager.
Feb 08 08:12:24 h18 sbd[33927]:       pcmk:   notice: servant_pcmk: Monitoring Pacemaker health
Feb 08 08:12:24 h18 systemd[1]: Stopping Shared-storage based fencing daemon...
Feb 08 08:12:24 h18 systemd[1]: Stopping Corosync Cluster Engine...
Feb 08 08:12:24 h18 sbd[8264]:  warning: cleanup_servant_by_pid: Servant for pcmk (pid: 33927) has terminated
Feb 08 08:12:24 h18 sbd[8264]:  warning: cleanup_servant_by_pid: Servant for cluster (pid: 8271) has terminated
Feb 08 08:12:24 h18 corosync[33929]: Signaling Corosync Cluster Engine (corosync) to terminate: [  OK  ]
Feb 08 08:12:24 h18 corosync[8279]:   [MAIN  ] Node was shut down by a signalFeb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Unloading all Corosync service engines.
Feb 08 08:12:24 h18 corosync[8279]:   [QB    ] withdrawing server sockets
Feb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
Feb 08 08:12:24 h18 corosync[8279]:   [QB    ] withdrawing server sockets
Feb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Service engine unloaded: corosync configuration map access
Feb 08 08:12:24 h18 corosync[8279]:   [QB    ] withdrawing server sockets
Feb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Service engine unloaded: corosync configuration service
Feb 08 08:12:24 h18 corosync[8279]:   [QB    ] withdrawing server sockets
Feb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Feb 08 08:12:24 h18 corosync[8279]:   [QB    ] withdrawing server sockets
Feb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Feb 08 08:12:24 h18 corosync[8279]:   [SERV  ] Service engine unloaded: corosync profile loading service
Feb 08 08:12:24 h18 corosync[8279]:   [MAIN  ] Corosync Cluster Engine exiting normally
Feb 08 08:12:24 h18 corosync[33929]: Waiting for corosync services to unload:[  OK  ]
Feb 08 08:12:24 h18 systemd[1]: Stopped Corosync Cluster Engine.
Feb 08 08:12:24 h18 sbd[8264]:  warning: cleanup_servant_by_pid: Servant for /dev/disk/by-id/dm-name-SBD_1-3P1 (pid: 8268) has terminated
Feb 08 08:12:24 h18 sbd[8264]:  warning: cleanup_servant_by_pid: Servant for /dev/disk/by-id/dm-name-SBD_1-3P2 (pid: 8269) has terminated
Feb 08 08:12:24 h18 systemd[1]: Stopped Shared-storage based fencing daemon.

So I hope we can agree that that was a "clean" shutdown of the node.

Now the new DC:
Feb 08 08:12:24 h16 pacemaker-controld[7525]:  notice: Our peer on the DC (h18) is dead
Feb 08 08:12:24 h16 pacemaker-controld[7525]:  notice: State transition S_NOT_DC -> S_ELECTION
Feb 08 08:12:24 h16 pacemaker-controld[7525]:  notice: State transition S_ELECTION -> S_INTEGRATION
Feb 08 08:12:24 h16 pacemaker-attrd[7523]:  notice: Node h18 state is now lost
Feb 08 08:12:24 h16 pacemaker-attrd[7523]:  notice: Removing all h18 attributes for peer loss
Feb 08 08:12:24 h16 pacemaker-attrd[7523]:  notice: Purged 1 peer with id=118 and/or uname=h18 from the membership cache
Feb 08 08:12:24 h16 pacemaker-fenced[7521]:  notice: Node h18 state is now lost
Feb 08 08:12:24 h16 pacemaker-fenced[7521]:  notice: Purged 1 peer with id=118 and/or uname=h18 from the membership cache
Feb 08 08:12:24 h16 pacemaker-based[7520]:  notice: Node h18 state is now lost
Feb 08 08:12:24 h16 pacemaker-based[7520]:  notice: Purged 1 peer with id=118 and/or uname=h18 from the membership cache
Feb 08 08:12:24 h16 corosync[6567]:   [TOTEM ] A new membership (172.20.16.16:42448) was formed. Members left: 118
Feb 08 08:12:24 h16 corosync[6567]:   [CPG   ] downlist left_list: 1 received
Feb 08 08:12:24 h16 corosync[6567]:   [CPG   ] downlist left_list: 1 received
Feb 08 08:12:24 h16 corosync[6567]:   [QUORUM] Members[2]: 116 119
Feb 08 08:12:24 h16 corosync[6567]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 08 08:12:24 h16 pacemakerd[7518]:  notice: Node h18 state is now lost
Feb 08 08:12:24 h16 pacemaker-controld[7525]:  notice: Node h18 state is now lost
Feb 08 08:12:24 h16 pacemaker-controld[7525]:  warning: Stonith/shutdown of node h18 was not expected
Feb 08 08:12:24 h16 kernel: dlm: closing connection to node 118
Feb 08 08:12:25 h16 pacemaker-controld[7525]:  notice: Updating quorum status to true (call=160)
Feb 08 08:12:25 h16 pacemaker-schedulerd[7524]:  notice: Watchdog will be used via SBD if fencing is required and stonith-watchdog-timeout is nonzero

Why isn't the shutdown expected? h18 had "announced" its shutdown:
Feb 08 08:11:52 h18 pacemakerd[8286]:  notice: Caught 'Terminated' signal
Feb 08 08:11:52 h18 pacemakerd[8286]:  notice: Shutting down Pacemaker
Feb 08 08:11:52 h18 pacemakerd[8286]:  notice: Stopping pacemaker-controld
Feb 08 08:11:52 h18 pacemaker-controld[8293]:  notice: Caught 'Terminated' signal
Feb 08 08:11:52 h18 pacemaker-controld[8293]:  notice: Initiating controller shutdown sequence
Feb 08 08:11:52 h18 pacemaker-controld[8293]:  notice: State transition S_IDLE -> S_POLICY_ENGINE
Feb 08 08:11:52 h18 pacemaker-attrd[8291]:  notice: Setting shutdown[h18]: (unset) -> 1612768312

So when h18 came up again, it was fenced:
Feb 08 08:12:25 h16 pacemaker-schedulerd[7524]:  warning: Scheduling Node h18 for STONITH
Feb 08 08:12:25 h16 pacemaker-schedulerd[7524]:  notice:  * Fence (reboot) h18 'resource actions are unrunnable'
Feb 08 08:12:25 h16 pacemaker-schedulerd[7524]:  warning: Calculated transition 0 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-36.bz2
Feb 08 08:12:25 h16 pacemaker-controld[7525]:  notice: Processing graph 0 (ref=pe_calc-dc-1612768345-16) derived from /var/lib/pacemaker/pengine/pe-warn-36.bz2
Feb 08 08:12:25 h16 pacemaker-controld[7525]:  notice: Requesting fencing (reboot) of node h18
Feb 08 08:12:25 h16 pacemaker-fenced[7521]:  notice: Client pacemaker-controld.7525.4c7f7600 wants to fence (reboot) 'h18' with device '(any)'
Feb 08 08:12:25 h16 pacemaker-fenced[7521]:  notice: Requesting peer fencing (reboot) targeting h18
Feb 08 08:12:30 h16 corosync[6567]:   [TOTEM ] A new membership (172.20.16.16:42456) was formed. Members joined: 118
Feb 08 08:12:30 h16 corosync[6567]:   [CPG   ] downlist left_list: 0 received
Feb 08 08:12:30 h16 corosync[6567]:   [CPG   ] downlist left_list: 0 received
Feb 08 08:12:30 h16 corosync[6567]:   [CPG   ] downlist left_list: 0 received
Feb 08 08:12:30 h16 corosync[6567]:   [QUORUM] Members[3]: 116 118 119
Feb 08 08:12:30 h16 corosync[6567]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 08 08:12:30 h16 pacemaker-controld[7525]:  notice: Node h18 state is now member
Feb 08 08:12:30 h16 pacemakerd[7518]:  notice: Node h18 state is now member
Feb 08 08:12:30 h16 pacemaker-fenced[7521]:  notice: Requesting that h16 perform 'reboot' action targeting h18
Feb 08 08:12:30 h16 pacemaker-fenced[7521]:  notice: prm_stonith_sbd is eligible to fence (reboot) h18: dynamic-list
Feb 08 08:12:37 h16 corosync[6567]:   [TOTEM ] A processor failed, forming new configuration.
Feb 08 08:12:44 h16 corosync[6567]:   [TOTEM ] A new membership (172.20.16.16:42460) was formed. Members left: 118
Feb 08 08:12:44 h16 corosync[6567]:   [TOTEM ] Failed to receive the leave message. failed: 118
Feb 08 08:12:44 h16 corosync[6567]:   [CPG   ] downlist left_list: 1 received
Feb 08 08:12:44 h16 corosync[6567]:   [CPG   ] downlist left_list: 1 received
Feb 08 08:12:44 h16 corosync[6567]:   [QUORUM] Members[2]: 116 119
Feb 08 08:12:44 h16 corosync[6567]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 08 08:12:44 h16 pacemaker-controld[7525]:  notice: Node h18 state is now lost
Feb 08 08:12:44 h16 pacemakerd[7518]:  notice: Node h18 state is now lost

Feb 08 08:12:24 h18 systemd[1]: Starting Shared-storage based fencing daemon...
Feb 08 08:12:24 h18 systemd[1]: Starting Corosync Cluster Engine...
Feb 08 08:12:24 h18 sbd[33948]:   notice: main: Do flush + write 'b' to sysrq in case of timeout
Feb 08 08:12:25 h18 sbd[33958]: /dev/disk/by-id/dm-name-SBD_1-3P1:   notice: servant_md: Monitoring slot 4 on disk /dev/disk/by-id/dm-name-SBD_1-3P1
Feb 08 08:12:25 h18 sbd[33959]: /dev/disk/by-id/dm-name-SBD_1-3P2:   notice: servant_md: Monitoring slot 4 on disk /dev/disk/by-id/dm-name-SBD_1-3P2
Feb 08 08:12:25 h18 sbd[33960]:       pcmk:   notice: servant_pcmk: Monitoring Pacemaker health
Feb 08 08:12:25 h18 sbd[33962]:    cluster:   notice: servant_cluster: Monitoring unknown cluster health
Feb 08 08:12:25 h18 corosync[33964]:   [MAIN  ] Corosync Cluster Engine ('2.4.5'): started and ready to provide service.
Feb 08 08:12:25 h18 corosync[33964]:   [MAIN  ] Corosync built-in features: testagents systemd qdevices qnetd pie relro bindnow
Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] Initializing transport (UDP/IP Unicast).
Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] Initializing transport (UDP/IP Unicast).
Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] The network interface [172.20.16.18] is now up.
Feb 08 08:12:25 h18 corosync[33967]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Feb 08 08:12:25 h18 corosync[33967]:   [QB    ] server name: cmap
Feb 08 08:12:25 h18 corosync[33967]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Feb 08 08:12:25 h18 corosync[33967]:   [QB    ] server name: cfg
Feb 08 08:12:25 h18 corosync[33967]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Feb 08 08:12:25 h18 corosync[33967]:   [QB    ] server name: cpg
Feb 08 08:12:25 h18 corosync[33967]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Feb 08 08:12:25 h18 corosync[33967]:   [QUORUM] Using quorum provider corosync_votequorum
Feb 08 08:12:25 h18 corosync[33967]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Feb 08 08:12:25 h18 corosync[33967]:   [QB    ] server name: votequorum
Feb 08 08:12:25 h18 corosync[33967]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Feb 08 08:12:25 h18 corosync[33967]:   [QB    ] server name: quorum
Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] adding new UDPU member {172.20.16.16}
Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] adding new UDPU member {172.20.16.18}
Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] adding new UDPU member {172.20.16.19}
Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] The network interface [10.2.2.18] is now up.
Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] adding new UDPU member {10.2.2.16}
Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] adding new UDPU member {10.2.2.18}
Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] adding new UDPU member {10.2.2.19}
Feb 08 08:12:25 h18 corosync[33967]:   [TOTEM ] A new membership (172.20.16.18:42448) was formed. Members joined: 118
Feb 08 08:12:25 h18 corosync[33967]:   [CPG   ] downlist left_list: 0 received
Feb 08 08:12:25 h18 corosync[33967]:   [QUORUM] Members[1]: 118
Feb 08 08:12:25 h18 corosync[33967]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 08 08:12:25 h18 corosync[33950]: Starting Corosync Cluster Engine (corosync): [  OK  ]
Feb 08 08:12:25 h18 systemd[1]: Started Corosync Cluster Engine.
Feb 08 08:12:30 h18 corosync[33967]:   [TOTEM ] A new membership (172.20.16.16:42456) was formed. Members joined: 116 119
Feb 08 08:12:30 h18 corosync[33967]:   [CPG   ] downlist left_list: 0 received
Feb 08 08:12:30 h18 corosync[33967]:   [CPG   ] downlist left_list: 0 received
Feb 08 08:12:30 h18 corosync[33967]:   [CPG   ] downlist left_list: 0 received
Feb 08 08:12:30 h18 corosync[33967]:   [QUORUM] This node is within the primary component and will provide service.
Feb 08 08:12:30 h18 corosync[33967]:   [QUORUM] Members[3]: 116 118 119
Feb 08 08:12:30 h18 corosync[33967]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 08 08:12:30 h18 sbd[33953]:   notice: inquisitor_child: Servant cluster is healthy (age: 0)

Broadcast message from systemd-journald at h18 (Mon 2021-02-08 08:12:32 CET):

sbd[33953]:    emerg: do_exit: Rebooting system: reboot

Feb 08 08:12:32 h18 sbd[33958]: /dev/disk/by-id/dm-name-SBD_1-3P1:   notice: servant_md: Received command reset from h16 on disk /dev/disk/by-id/dm-name-SBD_1-3P1
Feb 08 08:12:32 h18 sbd[33959]: /dev/disk/by-id/dm-name-SBD_1-3P2:   notice: servant_md: Received command reset from h16 on disk /dev/disk/by-id/dm-name-SBD_1-3P2
Feb 08 08:12:32 h18 sbd[33953]:  warning: inquisitor_child: /dev/disk/by-id/dm-name-SBD_1-3P1 requested a reset
Feb 08 08:12:32 h18 sbd[33953]:    emerg: do_exit: Rebooting system: reboot

OS/software:
SLES15 SP2
ph16:~ # rpm -q pacemaker corosync
pacemaker-2.0.4+20200616.2deceaa3a-3.3.1.x86_64
corosync-2.4.5-6.3.2.x86_64

Regards,
Ulrich




More information about the Users mailing list