[ClusterLabs] Why would a standby node be fenced? (was: How to set up fencing/stonith)
Jehan-Guillaume de Rorthais
jgdr at dalibo.com
Thu May 31 16:43:11 EDT 2018
On Thu, 31 May 2018 22:52:12 +0300
Andrei Borzenkov <arvidjaar at gmail.com> wrote:
> 31.05.2018 22:18, Jehan-Guillaume de Rorthais пишет:
> > Sorry for getting back to you so late.
> >
> > On Fri, 25 May 2018 11:58:59 -0600
> > Casey & Gina <caseyandgina at icloud.com> wrote:
> >
> >>> On May 25, 2018, at 7:01 AM, Casey Allen Shobe <caseyandgina at icloud.com>
> >>> wrote:
> >>>> Actually, why is Pacemaker fencing the standby node just because a
> >>>> resource fails to start there? I thought only the master should be
> >>>> fenced if it were assumed to be broken.
> >>
> >> This is probably the most important thing to ask outside of the PAF
> >> resource agent which many may not be as fluent with as pacemaker itself,
> >> and perhaps the most indicative of me setting something up incorrectly
> >> outside of that resource agent.
> >>
> >> My understanding of fencing was that pacemaker would only fence a node if
> >> it was the master but had stopped responding, to avoid a split-brain
> >> situation. Why would pacemaker ever fence a standby node with no resources
> >> currently allocated to it?
> >
> > So, as discussed on IRC and for the mailing list history, here is the
> > answer:
> >
> > https://clusterlabs.github.io/PAF/administration.html#failover
> >
> > In short: after a failure (either on a primary or a standby), you MUST fix
> > things on the node before starting Pacemaker.
> >
> > If you don't, PAF will detect something incoherent and raise an error,
> > leading Pacemaker to most likely fence your node, again.
> >
>
> Well, that does not sound very polite to user :)
Sure :)
But at least, It's been documented as you pointed earlier.
After a failure and an automatic failover, either you have some automatic
failback process somewhere...or you have to fix some things around.
PAF is not able to do automatic failback.
> Another database RA I mentioned somewhere in this thread has different
> approach - it starts database in its monitor action and start action is
> effectively dummy.
Mh, I would have to study that. But I'm not thrill about such behavior at a
first look.
> So start always succeeds from pacemaker point of
> view, but database won't be started until manually synchronized again by
> administrator.
It seems scary...What about the stop action? What if the monitor detect an
error? Well, I really should check this RA you are talking about to answer my
questions.
> Downside is that pacemaker resource status does not reflect database
> status. I wish pacemaker supported something like "requires manual
> intervention" resource state that would not be treated like error
> (causing all sorts of fatal consequences) but still evaluated for
> dependencies (i.e. dependent resources would not be started). That would
> be ideal for such case.
Good idea.
I have a couple more:
* handling errors from notify actions
* supporting mgirate-to/from for multistate RA
* having real infinite master score :)
Cheers,
More information about the Users
mailing list