[ClusterLabs] [External] : Re: Fence Agent tests

Sat Nov 5 16:54:55 EDT 2022

> -----Original Message-----
> From: Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
> Sent: Saturday, November 5, 2022 3:45 PM
> To: users at clusterlabs.org
> Cc: Robert Hayden <robert.h.hayden at oracle.com>
> Subject: Re: [ClusterLabs] [External] : Re: Fence Agent tests
> 
> On Sat, 5 Nov 2022 20:53:09 +0100
> Valentin Vidić via Users <users at clusterlabs.org> wrote:
> 
> > On Sat, Nov 05, 2022 at 06:47:59PM +0000, Robert Hayden wrote:
> > > That was my impression as well...so I may have something wrong.  My
> > > expectation was that SBD daemon should be writing to the
> /dev/watchdog
> > > within 20 seconds and the kernel watchdog would self fence.
> >
> > I don't see anything unusual in the config except that pacemaker mode is
> > also enabled. This means that the cluster is providing signal for sbd even
> > when the storage device is down, for example:
> >
> > 883 ?        SL     0:00 sbd: inquisitor
> > 892 ?        SL     0:00  \_ sbd: watcher: /dev/vdb1 - slot: 0 - uuid: ...
> > 893 ?        SL     0:00  \_ sbd: watcher: Pacemaker
> > 894 ?        SL     0:00  \_ sbd: watcher: Cluster
> >
> > You can strace different sbd processes to see what they are doing at any
> > point.
> 
> I suspect both watchers should detect the loss of network/communication
> with
> the other node.
> 
> BUT, when sbd is in Pacemaker mode, it doesn't reset the node if the
> local **Pacemaker** is still quorate (via corosync). See the full chapter:
> «If Pacemaker integration is activated, SBD will not self-fence if **device**
> majority is lost [...]»
> https://urldefense.com/v3/__https://documentation.suse.com/sle-ha/15-
> SP4/html/SLE-HA-all/cha-ha-storage-
> protect.html__;!!ACWV5N9M2RV99hQ!LXxpjg0QHdAP0tvr809WCErcpPH0lx
> MKesDNqK-PU_Xpvb_KIGlj3uJcVLIbzQLViOi3EiSV3bkPUCHr$
> 
> Would it be possible that no node is shutting down because the cluster is in
> two-node mode? Because of this mode, both would keep the quorum
> expecting the
> fencing to kill the other one... Except there's no active fencing here, only
> "self-fencing".
> 

I failed to mention I also have a Quorum Device also setup to add its vote to the quorum.  
So two_node is not enabled.  I suspect Valentin was onto to something with pacemaker keeping
the watchdog device updated as it thinks the cluster is ok.  Need to research and 
test that theory out.  I will try to carve some time out next week for that.

Appreciate all of the feedback.  I have been dealing with Cluster Suite for a decade+
but focused on the company's setup.  I still have lots to learn, which keeps me
interested.

> To verify this guess, check the corosync conf for the "two_node" parameter
> and
> if both nodes still report as quorate during network outage using:
> 
>   corosync-quorumtool -s
> 
> If this turn to be a good guess, without **active** fencing, I suppose a
> cluster
> can not rely on the two-node mode. I'm not sure what would be the best
> setup
> though.
> 
> Regards,