[Pacemaker] stonith/SBD question in the event of a lost node

mark - pacemaker list m+pacemaker at nerdish.us
Mon Apr 2 10:33:22 EDT 2012


I'm just looking to verify that I'm understanding/configuring SBD
correctly.  It works great in the controlled cases where you unplug a node
from the network (it gets fenced via SBD) or remove its access to the
shared disk (the node suicides).  However, In the event of a hardware
failure or power interruption that takes a node offline before SBD can
fence it, if that node never comes back into the cluster then its resources
can't ever start anywhere else.  The surviving nodes will continue to try
to fence the dead node at regular intervals but can never succeed.

It makes sense why this would be the case, as without a successful fence
operation the remaining nodes have no way of knowing if it's safe to start
those resources.  Still, am I missing some option or setting that may allow
for a safe auto-recovery, or is it a caveat of SBD that if a node leaves
suddenly and uncleanly, its resources are gone until you do some heavy
manual intervention?  I suppose this may be one of the reasons that fencing
via power devices is pretty much the best way to go about it?

