[Pacemaker] stonith/SBD question in the event of a lost node

Lars Marowsky-Bree lmb at suse.com
Mon Apr 2 11:35:25 EDT 2012

On 2012-04-02T09:33:22, mark - pacemaker list <m+pacemaker at nerdish.us> wrote:

> Hello,
> I'm just looking to verify that I'm understanding/configuring SBD
> correctly.  It works great in the controlled cases where you unplug a node
> from the network (it gets fenced via SBD) or remove its access to the
> shared disk (the node suicides).  However, In the event of a hardware
> failure or power interruption that takes a node offline before SBD can
> fence it, if that node never comes back into the cluster then its resources
> can't ever start anywhere else.  The surviving nodes will continue to try
> to fence the dead node at regular intervals but can never succeed.

No, that is not correct.

The node will be fenced implicitly - the poison pill is still written,
and SBD knows that the node will have either read it (and committed
suicide), determined that it was unable to read it (and committed
suicide), or the watchdog will have triggered if SBD itself has failed
beyond hope (i.e., the node will have committed suicide). Hence, after
the msgwait timeout, the node will be declared "successfully dead" after
the poison pill was written.

What can affect fencing is the inability to write the poison pill to the
(majority of) sbd device(s); e.g., the connection between the surviving
nodes and the (majority of) sbd device(s) is broken.

Or, theoretically, if the node has never been up and claimed its slot on
them; but that is indeed reasonably unlikely.

So the resources will be claimed afterwards; of course, the
stonith-timeout needs to be higher than msgwait for this to work.

Are you actually seeing the behaviour you describe (in which case it is
either a bug or something else going wrong), or is this speculation?

> manual intervention?  I suppose this may be one of the reasons that fencing
> via power devices is pretty much the best way to go about it?

No, fencing via power devices exposes one to the madness that is
management board firmware. If I have the choice, I'll always pick SBD.


Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

More information about the Pacemaker mailing list