[ClusterLabs] [External] : Re: Fence Agent tests

Sat Nov 5 17:17:56 EDT 2022

On Sat, 5 Nov 2022 20:54:55 +0000
Robert Hayden <robert.h.hayden at oracle.com> wrote:

> > -----Original Message-----
> > From: Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
> > Sent: Saturday, November 5, 2022 3:45 PM
> > To: users at clusterlabs.org
> > Cc: Robert Hayden <robert.h.hayden at oracle.com>
> > Subject: Re: [ClusterLabs] [External] : Re: Fence Agent tests
> > 
> > On Sat, 5 Nov 2022 20:53:09 +0100
> > Valentin Vidić via Users <users at clusterlabs.org> wrote:
> >   
> > > On Sat, Nov 05, 2022 at 06:47:59PM +0000, Robert Hayden wrote:  
> > > > That was my impression as well...so I may have something wrong.  My
> > > > expectation was that SBD daemon should be writing to the  
> > /dev/watchdog  
> > > > within 20 seconds and the kernel watchdog would self fence.  
> > >
> > > I don't see anything unusual in the config except that pacemaker mode is
> > > also enabled. This means that the cluster is providing signal for sbd even
> > > when the storage device is down, for example:
> > >
> > > 883 ?        SL     0:00 sbd: inquisitor
> > > 892 ?        SL     0:00  \_ sbd: watcher: /dev/vdb1 - slot: 0 - uuid: ...
> > > 893 ?        SL     0:00  \_ sbd: watcher: Pacemaker
> > > 894 ?        SL     0:00  \_ sbd: watcher: Cluster
> > >
> > > You can strace different sbd processes to see what they are doing at any
> > > point.  
> > 
> > I suspect both watchers should detect the loss of network/communication
> > with
> > the other node.
> > 
> > BUT, when sbd is in Pacemaker mode, it doesn't reset the node if the
> > local **Pacemaker** is still quorate (via corosync). See the full chapter:
> > «If Pacemaker integration is activated, SBD will not self-fence if
> > **device** majority is lost [...]»
> > https://urldefense.com/v3/__https://documentation.suse.com/sle-ha/15-
> > SP4/html/SLE-HA-all/cha-ha-storage-
> > protect.html__;!!ACWV5N9M2RV99hQ!LXxpjg0QHdAP0tvr809WCErcpPH0lx
> > MKesDNqK-PU_Xpvb_KIGlj3uJcVLIbzQLViOi3EiSV3bkPUCHr$
> > 
> > Would it be possible that no node is shutting down because the cluster is in
> > two-node mode? Because of this mode, both would keep the quorum
> > expecting the
> > fencing to kill the other one... Except there's no active fencing here, only
> > "self-fencing".
> >   
> 
> I failed to mention I also have a Quorum Device also setup to add its vote to
> the quorum. So two_node is not enabled. 

oh, ok.

> I suspect Valentin was onto to something with pacemaker keeping the watchdog
> device updated as it thinks the cluster is ok.  Need to research and test
> that theory out.  I will try to carve some time out next week for that.

AFAIK, Pacemaker strictly rely on SBD to deal with the watchdog. It doesn't feed
it by itself.

In Pacemaker mode, SBD is watching the two most important part of the cluster:
Pacemaker and Corosync:

* the "Pacemaker watcher" of SBD connects to the CIB and check it's still
  updated on a regular basis and the self-node is marked online.
* the "Cluster watchers" all connect with each others using a dedicated
  communication group in corosync ring(s).

Both watchers can report a failure to SBD that would self-stop the node.

If the network if down, I suppose the cluster watcher should complain. But I
suspect Pacemaker somehow keeps reporting as quorate, thus, forbidding SBD to
kill the whole node...

> Appreciate all of the feedback.  I have been dealing with Cluster Suite for a
> decade+ but focused on the company's setup.  I still have lots to learn,
> which keeps me interested.

+1

Keep us informed!

Regards,