<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Mar 1, 2022 at 10:05 AM Ulrich Windl <<a href="mailto:Ulrich.Windl@rz.uni-regensburg.de">Ulrich.Windl@rz.uni-regensburg.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi!<br>

<br>

For current SLES15 SP3 I noticed an oddity when the node running the DC is going to be fenced:<br>

It seems that another node is performing recovery operations while the old DC is not confirmed to be fenced.<br>

<br>

Like this (116 is the DC):<br>

Mar 01 01:33:53 h18 corosync[6754]:   [TOTEM ] A new membership (<a href="http://172.20.16.18:45612" rel="noreferrer" target="_blank">172.20.16.18:45612</a>) was formed. Members left: 116<br>

<br>

Mar 01 01:33:53 h18 corosync[6754]:   [MAIN  ] Completed service synchronization, ready to provide service.<br>

Mar 01 01:33:53 h18 pacemaker-controld[6980]:  notice: Our peer on the DC (h16) is dead<br>

Mar 01 01:33:53 h18 pacemaker-controld[6980]:  notice: State transition S_NOT_DC -> S_ELECTION<br>

<br>

Mar 01 01:33:53 h18 dlm_controld[8544]: 394518 fence request 116 pid 16307 nodedown time 1646094833 fence_all dlm_stonith<br>

Mar 01 01:33:53 h18 pacemaker-controld[6980]:  notice: State transition S_ELECTION -> S_INTEGRATION<br>

Mar 01 01:33:53 h18 dlm_stonith[16307]: stonith_api_time: Found 1 entries for 116/(null): 0 in progress, 0 completed<br>

Mar 01 01:33:53 h18 pacemaker-fenced[6973]:  notice: Client stonith-api.16307.4961743f wants to fence (reboot) '116' with device '(any)'<br>

Mar 01 01:33:53 h18 pacemaker-fenced[6973]:  notice: Requesting peer fencing (reboot) targeting h16<br>

<br>

Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]:  warning: Cluster node h16 will be fenced: peer is no longer part of the cluster<br>

Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]:  warning: Node h16 is unclean<br>

<br>

(so far, so good)<br>

Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]:  warning: Scheduling Node h16 for STONITH<br>

Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]:  notice:  * Fence (reboot) h16 'peer is no longer part of the cluster'<br>

<br>

Mar 01 01:33:53 h18 pacemaker-controld[6980]:  notice: Initiating monitor operation prm_stonith_sbd_monitor_600000 locally on h18<br>

Mar 01 01:33:53 h18 pacemaker-controld[6980]:  notice: Requesting local execution of monitor operation for prm_stonith_sbd on h18<br>

Mar 01 01:33:53 h18 pacemaker-controld[6980]:  notice: Initiating stop operation prm_cron_snap_v17_stop_0 on h19<br>

(isn't h18 playing DC already while h16 isn't fenced yet?)<br></blockquote><div><br></div><div>periodic monitors should happen autonomously.</div><div>as long as you don't see pacemaker-schedulerd on h18 calculate a new transition recovering the resources</div><div>everything should be fine.</div><div>and yes to a certain extent h18 is playing DC (it is elected to be new DC) - somebody has to schedule fencing.</div><div><br></div><div>Klaus</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

Mar 01 01:35:23 h18 pacemaker-controld[6980]:  error: Node h18 did not send monitor result (via controller) within 90000ms (action timeout plus cluster-delay)<br>

Mar 01 01:35:23 h18 pacemaker-controld[6980]:  error: [Action   26]: In-flight resource op prm_stonith_sbd_monitor_600000 on h18 (priority: 9900, waiting: (null))<br>

Mar 01 01:35:23 h18 pacemaker-controld[6980]:  notice: Transition 0 aborted: Action lost<br>

Mar 01 01:35:23 h18 pacemaker-controld[6980]:  warning: rsc_op 26: prm_stonith_sbd_monitor_600000 on h18 timed out<br>

(whatever that means)<br>

<br>

(now the fencing confirmation follows)<br>

Mar 01 01:35:55 h18 pacemaker-fenced[6973]:  notice: Operation 'reboot' [16309] (call 2 from stonith-api.16307) for host 'h16' with device 'prm_stonith_sbd' returned: 0 (OK)<br>

Mar 01 01:35:55 h18 pacemaker-fenced[6973]:  notice: Operation 'reboot' targeting h16 on h18 for stonith-api.16307@h18.36b9a9bb: OK<br>

Mar 01 01:35:55 h18 stonith-api[16307]: stonith_api_kick: Node 116/(null) kicked: reboot<br>

Mar 01 01:35:55 h18 pacemaker-fenced[6973]:  notice: Operation 'reboot' targeting h16 on rksaph18 for pacemaker-controld.6980@h18.8ce2f33f (merged): OK<br>

Mar 01 01:35:55 h18 pacemaker-controld[6980]:  notice: Peer h16 was terminated (reboot) by h18 on behalf of stonith-api.16307: OK<br>

Mar 01 01:35:55 h18 pacemaker-controld[6980]:  notice: Stonith operation 2/1:0:0:a434124e-3e35-410d-8e17-ef9ae4e4e6eb: OK (0)<br>

Mar 01 01:35:55 h18 pacemaker-controld[6980]:  notice: Peer h16 was terminated (reboot) by h18 on behalf of pacemaker-controld.6980: OK<br>

<br>

(actual recovery happens)<br>

Mar 01 01:35:55 h18 kernel: ocfs2: Begin replay journal (node 116, slot 0) on device (9,10)<br>

<br>

Mar 01 01:35:55 h18 kernel: md: md10: resync done.<br>

<br>

(more actions follow)<br>

Mar 01 01:35:56 h18 pacemaker-schedulerd[6978]:  notice: Calculated transition 1, saving inputs in /var/lib/pacemaker/pengine/pe-input-87.bz2<br>

<br>

(actions completed)<br>

Mar 01 01:37:18 h18 pacemaker-controld[6980]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE<br>

<br>

(pacemaker-2.0.5+20201202.ba59be712-150300.4.16.1.x86_64)<br>

<br>

Did I misunderstand something, or does it look like a bug?<br>

<br>

Regards,<br>

Ulrich<br>

<br>

<br>

_______________________________________________<br>

Manage your subscription:<br>

<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

<br>

ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

<br>

</blockquote></div></div>