[ClusterLabs] Why would a standby node be fenced? (was: How to set up fencing/stonith)

Thu May 31 19:18:50 UTC 2018

Sorry for getting back to you so late.

On Fri, 25 May 2018 11:58:59 -0600
Casey & Gina <caseyandgina at icloud.com> wrote:

> > On May 25, 2018, at 7:01 AM, Casey Allen Shobe <caseyandgina at icloud.com>
> > wrote: 
> >> Actually, why is Pacemaker fencing the standby node just because a
> >> resource fails to start there?  I thought only the master should be fenced
> >> if it were assumed to be broken.  
> 
> This is probably the most important thing to ask outside of the PAF resource
> agent which many may not be as fluent with as pacemaker itself, and perhaps
> the most indicative of me setting something up incorrectly outside of that
> resource agent.
> 
> My understanding of fencing was that pacemaker would only fence a node if it
> was the master but had stopped responding, to avoid a split-brain situation.
> Why would pacemaker ever fence a standby node with no resources currently
> allocated to it?

So, as discussed on IRC and for the mailing list history, here is the answer:

https://clusterlabs.github.io/PAF/administration.html#failover

In short: after a failure (either on a primary or a standby), you MUST fix
things on the node before starting Pacemaker.

If you don't, PAF will detect something incoherent and raise an error, leading
Pacemaker to most likely fence your node, again.

As instance, after a primary crash, you will have to resync it as a standby with
the new master before starting Pacemaker on the node and giving PAF the relay.
It is actually really important if you don't want to end up with a silently
corrupted standby in your cluster.

Cheers,