[ClusterLabs] [External] : Re: Fence Agent tests

Tue Nov 15 07:07:52 EST 2022

On Wed, Nov 9, 2022 at 2:58 PM Robert Hayden <robert.h.hayden at oracle.com>
wrote:

>
> > -----Original Message-----
> > From: Users <users-bounces at clusterlabs.org> On Behalf Of Andrei
> > Borzenkov
> > Sent: Wednesday, November 9, 2022 2:59 AM
> > To: Cluster Labs - All topics related to open-source clustering welcomed
> > <users at clusterlabs.org>
> > Subject: Re: [ClusterLabs] [External] : Re: Fence Agent tests
> >
> > On Mon, Nov 7, 2022 at 5:07 PM Robert Hayden
> > <robert.h.hayden at oracle.com> wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Users <users-bounces at clusterlabs.org> On Behalf Of Valentin
> > Vidic
> > > > via Users
> > > > Sent: Sunday, November 6, 2022 5:20 PM
> > > > To: users at clusterlabs.org
> > > > Cc: Valentin Vidić <vvidic at valentin-vidic.from.hr>
> > > > Subject: Re: [ClusterLabs] [External] : Re: Fence Agent tests
> > > >
> > > > On Sun, Nov 06, 2022 at 09:08:19PM +0000, Robert Hayden wrote:
> > > > > When SBD_PACEMAKER was set to "yes", the lack of network
> > connectivity
> > > > to the node
> > > > > would be seen and acted upon by the remote nodes (evicts and takes
> > > > > over ownership of the resources).  But the impacted node would just
> > > > > sit logging IO errors.  Pacemaker would keep updating the
> > /dev/watchdog
> > > > > device so SBD would not self evict.   Once I re-enabled the
> network,
> > then
> > > > the
> > > >
> > > > Interesting, not sure if this is the expected behaviour based on:
> > > >
> > > >
> >
> https://urldefense.com/v3/__https://lists.clusterlabs.org/pipermail/users/2

Which versions of pacemaker/corosync/sbd are you using?
iirc a result of the discussion linked was sbd checking watchdog-timeout
against sync-timeout in case of qdevice being used. default sync-timeout
is 30s and your watchdog-timeout is 20s. So I would expect kind of current
sbd should refuse startup.
But iirc in the discussion linked the pacemaker-node finally became
non-quorate.
There was just a possible split-brain-gap when sync-timeout >
watchdog-timeout.
So if your pacemaker-instance stays quorate it has to be something else
rather.

>
> > > > 017-
> > > >
> > August/022699.html__;!!ACWV5N9M2RV99hQ!IvnnhGI1HtTBGTKr4VFabWA
> > > > LeMfBWNhcS0FHsPFHwwQ3Riu5R3pOYLaQPNia-
> > > > GaB38wRJ7Eq4Q3GyT5C3s8y7w$
> > > >
> > > > Does SBD log "Majority of devices lost - surviving on pacemaker" or
> > > > some other messages related to Pacemaker?
> > >
> > > Yes.
> > >
> > > >
> > > > Also what is the status of Pacemaker when the network is down? Does
> it
> > > > report no quorum or something else?
> > > >
> > >
> > > Pacemaker on the failing node shows quorum even though it has lost
> > > communication to the Quorum Device and to the other node in the
> cluster.
> > > The non-failing node of the cluster can see the Quorum Device system
> and
> > > thus correctly determines to fence the failing node and take over its
> > > resources.
>

Hmm ... maybe some problem with qdevice-setup and/or quorum stategy (LMS
for instance).
If quorum doesn't work properly your cluster won't work properly regardless
of sbd killing the node properly or not.

> > >
> > > Only after I run firewall-cmd --panic-off, will the failing node start
> to log
> > > messages about loss of TOTEM and getting a new consensus with the
> > > now visible members.
> > >
> >
> > Where exactly do you use firewalld panic mode? You have hosts, you
> > have VM, you have qnode ...
> >
> > Have you verified that the network is blocked bidirectionally? I had
> > rather mixed experience with asymmetrical firewalls which resembles
> > your description.
>
> In my testing harness, I will send a script to the remote node which
> contains the firewall-cmd --panic-on, a sleep command, and then
> turn off the panic mode.  That way I can adjust the length of time
> network is unavailable on a single node.  I used to log into a network
> switch to turn ports off, but that is not possible in a Cloud environment.
> I have also played with manually creating iptables rules, but the panic
> mode
> is simply easier and accomplishes the task.
>
> I have verified that when panic mode is on, no inbound or outbound
> network traffic is allowed.   This includes iSCSI packets as well.  You
> better
> have access to the console or the ability to reset the system.
>
>
> >
> > Also it may depend on the corosync driver in use.
> >
> > > I think all of that explains the lack of self-fencing when the sbd
> setting of
> > > SBD_PACEMAKER=yes is used.
>

Are you aware that when setting SBD_PACEMAKER=no with just a single
disk this disk will become a SPOF?

Klaus

> > >
> >
> > Correct. This means that at least under some conditions
> > pacemaker/corosync fail to detect isolation.
> > _______________________________________________
> > Manage your subscription:
> >
> https://urldefense.com/v3/__https://lists.clusterlabs.org/mailman/listinfo/u
> > sers__;!!ACWV5N9M2RV99hQ!IMFB2Teli90q80SZ0fS4861iqEF-
> > yFGiPUvE81iTEJM4MHWMqoPOAxaJL5Fwmyr8py4S4QRvU4INEiY6YXvIH5c$
> >
> > ClusterLabs home:
> > https://urldefense.com/v3/__https://www.clusterlabs.org/__;!!ACWV5N9
> > M2RV99hQ!IMFB2Teli90q80SZ0fS4861iqEF-
> > yFGiPUvE81iTEJM4MHWMqoPOAxaJL5Fwmyr8py4S4QRvU4INEiY6sVTZv74$
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20221115/9e0a19a7/attachment.htm>