[ClusterLabs] Why would a standby node be fenced? (was: How to set up fencing/stonith)

Thu May 31 16:43:11 EDT 2018

On Thu, 31 May 2018 22:52:12 +0300
Andrei Borzenkov <arvidjaar at gmail.com> wrote:

> 31.05.2018 22:18, Jehan-Guillaume de Rorthais пишет:
> > Sorry for getting back to you so late.
> > 
> > On Fri, 25 May 2018 11:58:59 -0600
> > Casey & Gina <caseyandgina at icloud.com> wrote:
> >   
> >>> On May 25, 2018, at 7:01 AM, Casey Allen Shobe <caseyandgina at icloud.com>
> >>> wrote:   
> >>>> Actually, why is Pacemaker fencing the standby node just because a
> >>>> resource fails to start there?  I thought only the master should be
> >>>> fenced if it were assumed to be broken.    
> >>
> >> This is probably the most important thing to ask outside of the PAF
> >> resource agent which many may not be as fluent with as pacemaker itself,
> >> and perhaps the most indicative of me setting something up incorrectly
> >> outside of that resource agent.
> >>
> >> My understanding of fencing was that pacemaker would only fence a node if
> >> it was the master but had stopped responding, to avoid a split-brain
> >> situation. Why would pacemaker ever fence a standby node with no resources
> >> currently allocated to it?  
> > 
> > So, as discussed on IRC and for the mailing list history, here is the
> > answer:
> > 
> > https://clusterlabs.github.io/PAF/administration.html#failover
> > 
> > In short: after a failure (either on a primary or a standby), you MUST fix
> > things on the node before starting Pacemaker.
> > 
> > If you don't, PAF will detect something incoherent and raise an error,
> > leading Pacemaker to most likely fence your node, again.
> >   
> 
> Well, that does not sound very polite to user :)

Sure :)

But at least, It's been documented as you pointed earlier.

After a failure and an automatic failover, either you have some automatic
failback process somewhere...or you have to fix some things around.

PAF is not able to do automatic failback.

> Another database RA I mentioned somewhere in this thread has different
> approach - it starts database in its monitor action and start action is
> effectively dummy.

Mh, I would have to study that. But I'm not thrill about such behavior at a
first look.

> So start always succeeds from pacemaker point of
> view, but database won't be started until manually synchronized again by
> administrator.

It seems scary...What about the stop action? What if the monitor detect an
error? Well, I really should check this RA you are talking about to answer my
questions.

> Downside is that pacemaker resource status does not reflect database
> status. I wish pacemaker supported something like "requires manual
> intervention" resource state that would not be treated like error
> (causing all sorts of fatal consequences) but still evaluated for
> dependencies (i.e. dependent resources would not be started). That would
> be ideal for such case.

Good idea.

I have a couple more:
* handling errors from notify actions
* supporting mgirate-to/from for multistate RA
* having real infinite master score :)

Cheers,