[ClusterLabs] Noticed oddity when DC is going to be fenced

Tue Mar 1 10:04:31 EST 2022

On Tue, 2022-03-01 at 10:05 +0100, Ulrich Windl wrote:
> Hi!
> 
> For current SLES15 SP3 I noticed an oddity when the node running the
> DC is going to be fenced:
> It seems that another node is performing recovery operations while
> the old DC is not confirmed to be fenced.
> 
> Like this (116 is the DC):
> Mar 01 01:33:53 h18 corosync[6754]:   [TOTEM ] A new membership
> (172.20.16.18:45612) was formed. Members left: 116
> 
> Mar 01 01:33:53 h18 corosync[6754]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Mar 01 01:33:53 h18 pacemaker-controld[6980]:  notice: Our peer on
> the DC (h16) is dead
> Mar 01 01:33:53 h18 pacemaker-controld[6980]:  notice: State
> transition S_NOT_DC -> S_ELECTION

At this point, h16 loses its DC status, so there is no DC

> 
> Mar 01 01:33:53 h18 dlm_controld[8544]: 394518 fence request 116 pid
> 16307 nodedown time 1646094833 fence_all dlm_stonith
> Mar 01 01:33:53 h18 pacemaker-controld[6980]:  notice: State
> transition S_ELECTION -> S_INTEGRATION

At this point, a new DC election has completed, and h18 is now the DC
(as indicated by the scheduler messages later)

> Mar 01 01:33:53 h18 dlm_stonith[16307]: stonith_api_time: Found 1
> entries for 116/(null): 0 in progress, 0 completed
> Mar 01 01:33:53 h18 pacemaker-fenced[6973]:  notice: Client stonith-
> api.16307.4961743f wants to fence (reboot) '116' with device '(any)'
> Mar 01 01:33:53 h18 pacemaker-fenced[6973]:  notice: Requesting peer
> fencing (reboot) targeting h16
> 
> Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]:  warning: Cluster
> node h16 will be fenced: peer is no longer part of the cluster
> Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]:  warning: Node h16 is
> unclean
> 
> (so far, so good)
> Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]:  warning: Scheduling
> Node h16 for STONITH
> Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]:  notice:  * Fence
> (reboot) h16 'peer is no longer part of the cluster'
> 
> Mar 01 01:33:53 h18 pacemaker-controld[6980]:  notice: Initiating
> monitor operation prm_stonith_sbd_monitor_600000 locally on h18
> Mar 01 01:33:53 h18 pacemaker-controld[6980]:  notice: Requesting
> local execution of monitor operation for prm_stonith_sbd on h18
> Mar 01 01:33:53 h18 pacemaker-controld[6980]:  notice: Initiating
> stop operation prm_cron_snap_v17_stop_0 on h19
> (isn't h18 playing DC already while h16 isn't fenced yet?)
> 
> Mar 01 01:35:23 h18 pacemaker-controld[6980]:  error: Node h18 did
> not send monitor result (via controller) within 90000ms (action
> timeout plus cluster-delay)
> Mar 01 01:35:23 h18 pacemaker-controld[6980]:  error: [Action   26]:
> In-flight resource op prm_stonith_sbd_monitor_600000 on h18
> (priority: 9900, waiting: (null))
> Mar 01 01:35:23 h18 pacemaker-controld[6980]:  notice: Transition 0
> aborted: Action lost
> Mar 01 01:35:23 h18 pacemaker-controld[6980]:  warning: rsc_op 26:
> prm_stonith_sbd_monitor_600000 on h18 timed out
> (whatever that means)
> 
> (now the fencing confirmation follows)
> Mar 01 01:35:55 h18 pacemaker-fenced[6973]:  notice: Operation
> 'reboot' [16309] (call 2 from stonith-api.16307) for host 'h16' with
> device 'prm_stonith_sbd' returned: 0 (OK)
> Mar 01 01:35:55 h18 pacemaker-fenced[6973]:  notice: Operation
> 'reboot' targeting h16 on h18 for stonith-api.16307 at h18.36b9a9bb: OK
> Mar 01 01:35:55 h18 stonith-api[16307]: stonith_api_kick: Node
> 116/(null) kicked: reboot
> Mar 01 01:35:55 h18 pacemaker-fenced[6973]:  notice: Operation
> 'reboot' targeting h16 on rksaph18 for 
> pacemaker-controld.6980 at h18.8ce2f33f (merged): OK
> Mar 01 01:35:55 h18 pacemaker-controld[6980]:  notice: Peer h16 was
> terminated (reboot) by h18 on behalf of stonith-api.16307: OK
> Mar 01 01:35:55 h18 pacemaker-controld[6980]:  notice: Stonith
> operation 2/1:0:0:a434124e-3e35-410d-8e17-ef9ae4e4e6eb: OK (0)
> Mar 01 01:35:55 h18 pacemaker-controld[6980]:  notice: Peer h16 was
> terminated (reboot) by h18 on behalf of pacemaker-controld.6980: OK
> 
> (actual recovery happens)
> Mar 01 01:35:55 h18 kernel: ocfs2: Begin replay journal (node 116,
> slot 0) on device (9,10)
> 
> Mar 01 01:35:55 h18 kernel: md: md10: resync done.
> 
> (more actions follow)
> Mar 01 01:35:56 h18 pacemaker-schedulerd[6978]:  notice: Calculated
> transition 1, saving inputs in /var/lib/pacemaker/pengine/pe-input-
> 87.bz2
> 
> (actions completed)
> Mar 01 01:37:18 h18 pacemaker-controld[6980]:  notice: State
> transition S_TRANSITION_ENGINE -> S_IDLE
> 
> (pacemaker-2.0.5+20201202.ba59be712-150300.4.16.1.x86_64)
> 
> Did I misunderstand something, or does it look like a bug?
> 
> Regards,
> Ulrich
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 
-- 
Ken Gaillot <kgaillot at redhat.com>