[ClusterLabs] Why would a standby node be fenced? (was: How to set up fencing/stonith)

Thu May 31 19:52:12 UTC 2018

31.05.2018 22:18, Jehan-Guillaume de Rorthais пишет:
> Sorry for getting back to you so late.
> 
> On Fri, 25 May 2018 11:58:59 -0600
> Casey & Gina <caseyandgina at icloud.com> wrote:
> 
>>> On May 25, 2018, at 7:01 AM, Casey Allen Shobe <caseyandgina at icloud.com>
>>> wrote: 
>>>> Actually, why is Pacemaker fencing the standby node just because a
>>>> resource fails to start there?  I thought only the master should be fenced
>>>> if it were assumed to be broken.  
>>
>> This is probably the most important thing to ask outside of the PAF resource
>> agent which many may not be as fluent with as pacemaker itself, and perhaps
>> the most indicative of me setting something up incorrectly outside of that
>> resource agent.
>>
>> My understanding of fencing was that pacemaker would only fence a node if it
>> was the master but had stopped responding, to avoid a split-brain situation.
>> Why would pacemaker ever fence a standby node with no resources currently
>> allocated to it?
> 
> So, as discussed on IRC and for the mailing list history, here is the answer:
> 
> https://clusterlabs.github.io/PAF/administration.html#failover
> 
> In short: after a failure (either on a primary or a standby), you MUST fix
> things on the node before starting Pacemaker.
> 
> If you don't, PAF will detect something incoherent and raise an error, leading
> Pacemaker to most likely fence your node, again.
> 

Well, that does not sound very polite to user :)

Another database RA I mentioned somewhere in this thread has different
approach - it starts database in its monitor action and start action is
effectively dummy. So start always succeeds from pacemaker point of
view, but database won't be started until manually synchronized again by
administrator.

Downside is that pacemaker resource status does not reflect database
status. I wish pacemaker supported something like "requires manual
intervention" resource state that would not be treated like error
(causing all sorts of fatal consequences) but still evaluated for
dependencies (i.e. dependent resources would not be started). That would
be ideal for such case.