[ClusterLabs] Antw: [EXT] Stonith failing

Ken Gaillot kgaillot at redhat.com
Tue Aug 18 10:02:33 EDT 2020


On Tue, 2020-08-18 at 08:21 +0200, Klaus Wenninger wrote:
> On 8/18/20 7:49 AM, Andrei Borzenkov wrote:
> > 17.08.2020 23:39, Jehan-Guillaume de Rorthais пишет:
> > > On Mon, 17 Aug 2020 10:19:45 -0500
> > > Ken Gaillot <kgaillot at redhat.com> wrote:
> > > 
> > > > On Fri, 2020-08-14 at 15:09 +0200, Gabriele Bulfon wrote:
> > > > > Thanks to all your suggestions, I now have the systems with
> > > > > stonith
> > > > > configured on ipmi.  
> > > > 
> > > > A word of caution: if the IPMI is on-board -- i.e. it shares
> > > > the same
> > > > power supply as the computer -- power becomes a single point of
> > > > failure. If the node loses power, the other node can't fence
> > > > because
> > > > the IPMI is also down, and the cluster can't recover.
> > > > 
> > > > Some on-board IPMI controllers can share an Ethernet port with
> > > > the main
> > > > computer, which would be a similar situation.
> > > > 
> > > > It's best to have a backup fencing method when using IPMI as
> > > > the
> > > > primary fencing method. An example would be an intelligent
> > > > power switch
> > > > or sbd.
> > > 
> > > How SBD would be useful in this scenario? Poison pill will not be
> > > swallowed by
> > > the dead node... Is it just to wait for the watchdog timeout?
> > > 
> > 
> > Node is expected to commit suicide if SBD lost access to shared
> > block
> > device. So either node swallowed poison pill and died or node died
> > because it realized it was impossible to see poison pill or node
> > was
> > dead already. After watchdog timeout (twice watchdog timeout for
> > safety)
> > we assume node is dead.
> 
> Yes, like this a suicide via watchdog will be triggered if there are
> issues with thedisk. This is why it is important to have a reliable
> watchdog with SBD even whenusing poison pill. As this alone would
> make a single shared disk a SPOF, runningwith pacemaker integration
> (default) a node with SBD will survive despite ofloosing the disk
> when it has quorum and pacemaker looks healthy. As corosync-quorum
> in 2-node-mode obviously won't be fit for this purpose SBD will
> switch
> to checking for presence of both nodes if 2-node-flag is set.
> 
> Sorry for the lengthy explanation but the full picture is required
> to understand whyit is sufficiently reliable and useful if configured
> correctly.
> 
> Klaus

What I'm not sure about is how watchdog-only sbd would behave as a
fail-back method for a regular fence device. Will the cluster wait for
the sbd timeout no matter what, or only if the regular fencing fails,
or ...?
-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Users mailing list