<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Nov 9, 2022 at 2:58 PM Robert Hayden <<a href="mailto:robert.h.hayden@oracle.com">robert.h.hayden@oracle.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>

> -----Original Message-----<br>

> From: Users <<a href="mailto:users-bounces@clusterlabs.org" target="_blank">users-bounces@clusterlabs.org</a>> On Behalf Of Andrei<br>

> Borzenkov<br>

> Sent: Wednesday, November 9, 2022 2:59 AM<br>

> To: Cluster Labs - All topics related to open-source clustering welcomed<br>

> <<a href="mailto:users@clusterlabs.org" target="_blank">users@clusterlabs.org</a>><br>

> Subject: Re: [ClusterLabs] [External] : Re: Fence Agent tests<br>

> <br>

> On Mon, Nov 7, 2022 at 5:07 PM Robert Hayden<br>

> <<a href="mailto:robert.h.hayden@oracle.com" target="_blank">robert.h.hayden@oracle.com</a>> wrote:<br>

> ><br>

> ><br>

> > > -----Original Message-----<br>

> > > From: Users <<a href="mailto:users-bounces@clusterlabs.org" target="_blank">users-bounces@clusterlabs.org</a>> On Behalf Of Valentin<br>

> Vidic<br>

> > > via Users<br>

> > > Sent: Sunday, November 6, 2022 5:20 PM<br>

> > > To: <a href="mailto:users@clusterlabs.org" target="_blank">users@clusterlabs.org</a><br>

> > > Cc: Valentin Vidić <<a href="mailto:vvidic@valentin-vidic.from.hr" target="_blank">vvidic@valentin-vidic.from.hr</a>><br>

> > > Subject: Re: [ClusterLabs] [External] : Re: Fence Agent tests<br>

> > ><br>

> > > On Sun, Nov 06, 2022 at 09:08:19PM +0000, Robert Hayden wrote:<br>

> > > > When SBD_PACEMAKER was set to "yes", the lack of network<br>

> connectivity<br>

> > > to the node<br>

> > > > would be seen and acted upon by the remote nodes (evicts and takes<br>

> > > > over ownership of the resources).  But the impacted node would just<br>

> > > > sit logging IO errors.  Pacemaker would keep updating the<br>

> /dev/watchdog<br>

> > > > device so SBD would not self evict.   Once I re-enabled the network,<br>

> then<br>

> > > the<br>

> > ><br>

> > > Interesting, not sure if this is the expected behaviour based on:<br>

> > ><br>

> > ><br>

> <a href="https://urldefense.com/v3/__https://lists.clusterlabs.org/pipermail/users/2" rel="noreferrer" target="_blank">https://urldefense.com/v3/__https://lists.clusterlabs.org/pipermail/users/2</a></blockquote><div><br></div><div>Which versions of pacemaker/corosync/sbd are you using?</div><div>iirc a result of the discussion linked was sbd checking watchdog-timeout</div><div>against sync-timeout in case of qdevice being used. default sync-timeout</div><div>is 30s and your watchdog-timeout is 20s. So I would expect kind of current</div><div>sbd should refuse startup.</div><div>But iirc in the discussion linked the pacemaker-node finally became non-quorate.</div><div>There was just a possible split-brain-gap when sync-timeout > watchdog-timeout.</div><div>So if your pacemaker-instance stays quorate it has to be something else rather.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>

> > > 017-<br>

> > ><br>

> August/022699.html__;!!ACWV5N9M2RV99hQ!IvnnhGI1HtTBGTKr4VFabWA<br>

> > > LeMfBWNhcS0FHsPFHwwQ3Riu5R3pOYLaQPNia-<br>

> > > GaB38wRJ7Eq4Q3GyT5C3s8y7w$<br>

> > ><br>

> > > Does SBD log "Majority of devices lost - surviving on pacemaker" or<br>

> > > some other messages related to Pacemaker?<br>

> ><br>

> > Yes.<br>

> ><br>

> > ><br>

> > > Also what is the status of Pacemaker when the network is down? Does it<br>

> > > report no quorum or something else?<br>

> > ><br>

> ><br>

> > Pacemaker on the failing node shows quorum even though it has lost<br>

> > communication to the Quorum Device and to the other node in the cluster.<br>

> > The non-failing node of the cluster can see the Quorum Device system and<br>

> > thus correctly determines to fence the failing node and take over its<br>

> > resources.<br></blockquote><div><br></div><div>Hmm ... maybe some problem with qdevice-setup and/or quorum stategy (LMS</div><div>for instance).</div><div>If quorum doesn't work properly your cluster won't work properly regardless</div><div>of sbd killing the node properly or not.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

> ><br>

> > Only after I run firewall-cmd --panic-off, will the failing node start to log<br>

> > messages about loss of TOTEM and getting a new consensus with the<br>

> > now visible members.<br>

> ><br>

> <br>

> Where exactly do you use firewalld panic mode? You have hosts, you<br>

> have VM, you have qnode ...<br>

> <br>

> Have you verified that the network is blocked bidirectionally? I had<br>

> rather mixed experience with asymmetrical firewalls which resembles<br>

> your description.<br>

<br>

In my testing harness, I will send a script to the remote node which <br>

contains the firewall-cmd --panic-on, a sleep command, and then <br>

turn off the panic mode.  That way I can adjust the length of time<br>

network is unavailable on a single node.  I used to log into a network <br>

switch to turn ports off, but that is not possible in a Cloud environment.<br>

I have also played with manually creating iptables rules, but the panic mode<br>

is simply easier and accomplishes the task.<br>

<br>

I have verified that when panic mode is on, no inbound or outbound<br>

network traffic is allowed.   This includes iSCSI packets as well.  You better<br>

have access to the console or the ability to reset the system.<br>

<br>

<br>

> <br>

> Also it may depend on the corosync driver in use.<br>

> <br>

> > I think all of that explains the lack of self-fencing when the sbd setting of<br>

> > SBD_PACEMAKER=yes is used.<br></blockquote><div><br></div><div>Are you aware that when setting SBD_PACEMAKER=no with just a single</div><div>disk this disk will become a SPOF?</div><div><br></div><div>Klaus</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

> ><br>

> <br>

> Correct. This means that at least under some conditions<br>

> pacemaker/corosync fail to detect isolation.<br>

> _______________________________________________<br>

> Manage your subscription:<br>

> <a href="https://urldefense.com/v3/__https://lists.clusterlabs.org/mailman/listinfo/u" rel="noreferrer" target="_blank">https://urldefense.com/v3/__https://lists.clusterlabs.org/mailman/listinfo/u</a><br>

> sers__;!!ACWV5N9M2RV99hQ!IMFB2Teli90q80SZ0fS4861iqEF-<br>

> yFGiPUvE81iTEJM4MHWMqoPOAxaJL5Fwmyr8py4S4QRvU4INEiY6YXvIH5c$<br>

> <br>

> ClusterLabs home:<br>

> <a href="https://urldefense.com/v3/__https://www.clusterlabs.org/__;!!ACWV5N9" rel="noreferrer" target="_blank">https://urldefense.com/v3/__https://www.clusterlabs.org/__;!!ACWV5N9</a><br>

> M2RV99hQ!IMFB2Teli90q80SZ0fS4861iqEF-<br>

> yFGiPUvE81iTEJM4MHWMqoPOAxaJL5Fwmyr8py4S4QRvU4INEiY6sVTZv74$<br>

_______________________________________________<br>

Manage your subscription:<br>

<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

<br>

ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

</blockquote></div></div>