[ClusterLabs] Noticed oddity when DC is going to be fenced
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Tue Mar 1 04:05:16 EST 2022
Hi!
For current SLES15 SP3 I noticed an oddity when the node running the DC is going to be fenced:
It seems that another node is performing recovery operations while the old DC is not confirmed to be fenced.
Like this (116 is the DC):
Mar 01 01:33:53 h18 corosync[6754]: [TOTEM ] A new membership (172.20.16.18:45612) was formed. Members left: 116
Mar 01 01:33:53 h18 corosync[6754]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Our peer on the DC (h16) is dead
Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: State transition S_NOT_DC -> S_ELECTION
Mar 01 01:33:53 h18 dlm_controld[8544]: 394518 fence request 116 pid 16307 nodedown time 1646094833 fence_all dlm_stonith
Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: State transition S_ELECTION -> S_INTEGRATION
Mar 01 01:33:53 h18 dlm_stonith[16307]: stonith_api_time: Found 1 entries for 116/(null): 0 in progress, 0 completed
Mar 01 01:33:53 h18 pacemaker-fenced[6973]: notice: Client stonith-api.16307.4961743f wants to fence (reboot) '116' with device '(any)'
Mar 01 01:33:53 h18 pacemaker-fenced[6973]: notice: Requesting peer fencing (reboot) targeting h16
Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: warning: Cluster node h16 will be fenced: peer is no longer part of the cluster
Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: warning: Node h16 is unclean
(so far, so good)
Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: warning: Scheduling Node h16 for STONITH
Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: notice: * Fence (reboot) h16 'peer is no longer part of the cluster'
Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Initiating monitor operation prm_stonith_sbd_monitor_600000 locally on h18
Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Requesting local execution of monitor operation for prm_stonith_sbd on h18
Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Initiating stop operation prm_cron_snap_v17_stop_0 on h19
(isn't h18 playing DC already while h16 isn't fenced yet?)
Mar 01 01:35:23 h18 pacemaker-controld[6980]: error: Node h18 did not send monitor result (via controller) within 90000ms (action timeout plus cluster-delay)
Mar 01 01:35:23 h18 pacemaker-controld[6980]: error: [Action 26]: In-flight resource op prm_stonith_sbd_monitor_600000 on h18 (priority: 9900, waiting: (null))
Mar 01 01:35:23 h18 pacemaker-controld[6980]: notice: Transition 0 aborted: Action lost
Mar 01 01:35:23 h18 pacemaker-controld[6980]: warning: rsc_op 26: prm_stonith_sbd_monitor_600000 on h18 timed out
(whatever that means)
(now the fencing confirmation follows)
Mar 01 01:35:55 h18 pacemaker-fenced[6973]: notice: Operation 'reboot' [16309] (call 2 from stonith-api.16307) for host 'h16' with device 'prm_stonith_sbd' returned: 0 (OK)
Mar 01 01:35:55 h18 pacemaker-fenced[6973]: notice: Operation 'reboot' targeting h16 on h18 for stonith-api.16307 at h18.36b9a9bb: OK
Mar 01 01:35:55 h18 stonith-api[16307]: stonith_api_kick: Node 116/(null) kicked: reboot
Mar 01 01:35:55 h18 pacemaker-fenced[6973]: notice: Operation 'reboot' targeting h16 on rksaph18 for pacemaker-controld.6980 at h18.8ce2f33f (merged): OK
Mar 01 01:35:55 h18 pacemaker-controld[6980]: notice: Peer h16 was terminated (reboot) by h18 on behalf of stonith-api.16307: OK
Mar 01 01:35:55 h18 pacemaker-controld[6980]: notice: Stonith operation 2/1:0:0:a434124e-3e35-410d-8e17-ef9ae4e4e6eb: OK (0)
Mar 01 01:35:55 h18 pacemaker-controld[6980]: notice: Peer h16 was terminated (reboot) by h18 on behalf of pacemaker-controld.6980: OK
(actual recovery happens)
Mar 01 01:35:55 h18 kernel: ocfs2: Begin replay journal (node 116, slot 0) on device (9,10)
Mar 01 01:35:55 h18 kernel: md: md10: resync done.
(more actions follow)
Mar 01 01:35:56 h18 pacemaker-schedulerd[6978]: notice: Calculated transition 1, saving inputs in /var/lib/pacemaker/pengine/pe-input-87.bz2
(actions completed)
Mar 01 01:37:18 h18 pacemaker-controld[6980]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE
(pacemaker-2.0.5+20201202.ba59be712-150300.4.16.1.x86_64)
Did I misunderstand something, or does it look like a bug?
Regards,
Ulrich
More information about the Users
mailing list