<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Nov 5, 2022 at 9:45 PM Jehan-Guillaume de Rorthais via Users <<a href="mailto:users@clusterlabs.org">users@clusterlabs.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Sat, 5 Nov 2022 20:53:09 +0100<br>

Valentin Vidić via Users <<a href="mailto:users@clusterlabs.org" target="_blank">users@clusterlabs.org</a>> wrote:<br>

<br>

> On Sat, Nov 05, 2022 at 06:47:59PM +0000, Robert Hayden wrote:<br>

> > That was my impression as well...so I may have something wrong.  My<br>

> > expectation was that SBD daemon should be writing to the /dev/watchdog<br>

> > within 20 seconds and the kernel watchdog would self fence.  <br>

> <br>

> I don't see anything unusual in the config except that pacemaker mode is<br>

> also enabled. This means that the cluster is providing signal for sbd even<br>

> when the storage device is down, for example:<br>

> <br>

> 883 ?        SL     0:00 sbd: inquisitor<br>

> 892 ?        SL     0:00  \_ sbd: watcher: /dev/vdb1 - slot: 0 - uuid: ...<br>

> 893 ?        SL     0:00  \_ sbd: watcher: Pacemaker<br>

> 894 ?        SL     0:00  \_ sbd: watcher: Cluster<br>

> <br>

> You can strace different sbd processes to see what they are doing at any<br>

> point.<br>

<br>

I suspect both watchers should detect the loss of network/communication with<br>

the other node.<br>

<br>

BUT, when sbd is in Pacemaker mode, it doesn't reset the node if the<br>

local **Pacemaker** is still quorate (via corosync). See the full chapter:<br>

«If Pacemaker integration is activated, SBD will not self-fence if **device**<br>

majority is lost [...]»<br>

<a href="https://documentation.suse.com/sle-ha/15-SP4/html/SLE-HA-all/cha-ha-storage-protect.html" rel="noreferrer" target="_blank">https://documentation.suse.com/sle-ha/15-SP4/html/SLE-HA-all/cha-ha-storage-protect.html</a><br>

<br>

Would it be possible that no node is shutting down because the cluster is in<br>

two-node mode? Because of this mode, both would keep the quorum expecting the<br>

fencing to kill the other one... Except there's no active fencing here, only<br>

"self-fencing".<br></blockquote><div><br></div><div>Seems not to be the case here but for completeness:</div><div>This fact should be recognized automatically by sbd (upstream since some time</div><div>in 2017 iirc) and instead of checking quorum sbd would then check for</div><div>presence of 2 nodes with the cpg-group. I hope corosync prevents 2-node & qdevice </div><div>set at the same time. But even in that case I would rather expect unexpected</div><div>self-fencing instead of the opposite.</div><div><br></div><div>Klaus</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

To verify this guess, check the corosync conf for the "two_node" parameter and<br>

if both nodes still report as quorate during network outage using:<br>

<br>

  corosync-quorumtool -s<br>

<br>

If this turn to be a good guess, without **active** fencing, I suppose a cluster<br>

can not rely on the two-node mode. I'm not sure what would be the best setup<br>

though.<br>

<br>

Regards,<br>

_______________________________________________<br>

Manage your subscription:<br>

<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

<br>

ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

</blockquote></div></div>