[ClusterLabs] [External] : Re: Fence Agent tests

Jehan-Guillaume de Rorthais jgdr at dalibo.com
Mon Nov 7 09:15:59 EST 2022


On Mon, 7 Nov 2022 14:06:51 +0000
Robert Hayden <robert.h.hayden at oracle.com> wrote:

> > -----Original Message-----
> > From: Users <users-bounces at clusterlabs.org> On Behalf Of Valentin Vidic
> > via Users
> > Sent: Sunday, November 6, 2022 5:20 PM
> > To: users at clusterlabs.org
> > Cc: Valentin Vidić <vvidic at valentin-vidic.from.hr>
> > Subject: Re: [ClusterLabs] [External] : Re: Fence Agent tests
> > 
> > On Sun, Nov 06, 2022 at 09:08:19PM +0000, Robert Hayden wrote:  
> > > When SBD_PACEMAKER was set to "yes", the lack of network connectivity  
> > to the node  
> > > would be seen and acted upon by the remote nodes (evicts and takes
> > > over ownership of the resources).  But the impacted node would just
> > > sit logging IO errors.  Pacemaker would keep updating the /dev/watchdog
> > > device so SBD would not self evict.   Once I re-enabled the network, then
> > >  
> > the
> > 
> > Interesting, not sure if this is the expected behaviour based on:
> > 
> > https://urldefense.com/v3/__https://lists.clusterlabs.org/pipermail/users/2
> > 017-
> > August/022699.html__;!!ACWV5N9M2RV99hQ!IvnnhGI1HtTBGTKr4VFabWA
> > LeMfBWNhcS0FHsPFHwwQ3Riu5R3pOYLaQPNia-
> > GaB38wRJ7Eq4Q3GyT5C3s8y7w$
> > 
> > Does SBD log "Majority of devices lost - surviving on pacemaker" or
> > some other messages related to Pacemaker?  
> 
> Yes.
> 
> > 
> > Also what is the status of Pacemaker when the network is down? Does it
> > report no quorum or something else?
> >   
> 
> Pacemaker on the failing node shows quorum even though it has lost 
> communication to the Quorum Device and to the other node in the cluster.

This is the main issue. Maybe inspecting the corosync-cmapctl output could shed
some lights on some setup we are missing?

> The non-failing node of the cluster can see the Quorum Device system and 
> thus correctly determines to fence the failing node and take over its 
> resources.

Normal.

> Only after I run firewall-cmd --panic-off, will the failing node start to log
> messages about loss of TOTEM and getting a new consensus with the 
> now visible members.
> 
> I think all of that explains the lack of self-fencing when the sbd setting of
> SBD_PACEMAKER=yes is used.

I'm not sure. If I understand correctly, SBD_PACEMAKER=yes only instruct sbd to
keep an eye on the pacemaker+corosync processes (as described up thread). It
doesn't explain why Pacemaker keeps holding the quorum, but I might miss
something...


More information about the Users mailing list