[ClusterLabs] What triggers fencing?

Mon Jul 9 12:18:15 EDT 2018

On 07/09/2018 05:53 PM, Digimer wrote:
> On 2018-07-09 11:45 AM, Klaus Wenninger wrote:
>> On 07/09/2018 05:33 PM, Digimer wrote:
>>> On 2018-07-09 09:56 AM, Klaus Wenninger wrote:
>>>> On 07/09/2018 03:49 PM, Digimer wrote:
>>>>> On 2018-07-09 08:31 AM, Klaus Wenninger wrote:
>>>>>> On 07/09/2018 02:04 PM, Confidential Company wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Any ideas what triggers fencing script or stonith?
>>>>>>>
>>>>>>> Given the setup below:
>>>>>>> 1. I have two nodes
>>>>>>> 2. Configured fencing on both nodes
>>>>>>> 3. Configured delay=15 and delay=30 on fence1(for Node1) and
>>>>>>> fence2(for Node2) respectively
>>>>>>>
>>>>>>> *What does it mean to configured delay in stonith? wait for 15 seconds
>>>>>>> before it fence the node?
>>>>>> Given that on a 2-node-cluster you don't have real quorum to make one
>>>>>> partial cluster fence the rest of the nodes the different delays are meant
>>>>>> to prevent a fencing-race.
>>>>>> Without different delays that would lead to both nodes fencing each
>>>>>> other at the same time - finally both being down.
>>>>> Not true, the faster node will kill the slower node first. It is
>>>>> possible that through misconfiguration, both could die, but it's rare
>>>>> and easily avoided with a 'delay="15"' set on the fence config for the
>>>>> node you want to win.
>>>> What exactly is not true? Aren't we saying the same?
>>>> Of course one of the delays can be 0 (most important is that
>>>> they are different).
>>> Perhaps I misunderstood your message. It seemed to me that the
>>> implication was that fencing in 2-node without a delay always ends up
>>> with both nodes being down, which isn't the case. It can happen if the
>>> fence methods are not setup right (ie: the node isn't set to immediately
>>> power off on ACPI power button event).
>> Yes, a misunderstanding I guess.
>>
>> Should have been more verbose in saying that due to the
>> time between the fencing-command fired off to the fencing
>> device and the actual fencing taking place (as you state
>> dependent on how it is configured in detail - but a measurable
>> time in all cases) there is a certain probability that when
>> both nodes start fencing at roughly the same time we will
>> end up with 2 nodes down.
>>
>> Everybody has to find his own tradeoff between reliability
>> fence-races are prevented and fencing delay I guess.
> We've used this;
>
> 1. IPMI (with the guest OS set to immediately power off) as primary,
> with a 15 second delay on the active node.
>
> 2. Two Switched PDUs (two power circuits, two PSUs) as backup fencing
> for when IPMI fails, with no delay.
>
> In ~8 years, across dozens and dozens of clusters and countless fence
> actions, we've never had a dual-fence event (where both nodes go down).
> So it can be done safely, but as always, test test test before prod.

No doubt about that this setup is working reliably.
You just have to know your fencing-devices and
which delays they involve.

If we are talking about SBD (with disk as otherwise
it doesn't work in a sensible way in 2-node-clusters)
for instance I would strongly advise using a delay.

So I guess it is important to understand the basic
idea behind this different delay-based fence-race
avoidance.
Afterwards you can still decide why it is no issue
in your own setup.

>
>>> If the delay is set on both nodes, and they are different, it will work
>>> fine. The reason not to do this is that if you use 0, then don't use
>>> anything at all (0 is default), and any other value causes avoidable
>>> fence delays.
>>>
>>>>> Don't use a delay on the other node, just the node you want to live in
>>>>> such a case.
>>>>>
>>>>>>> *Given Node1 is active and Node2 goes down, does it mean fence1 will
>>>>>>> first execute and shutdowns Node1 even though Node2 goes down?
>>>>>> If Node2 managed to sign off properly it will not.
>>>>>> If network-connection is down so that Node2 can't inform Node1 that it
>>>>>> is going
>>>>>> down and finally has stopped all resources it will be fenced by Node1.
>>>>>>
>>>>>> Regards,
>>>>>> Klaus
>>>>> Fencing occurs in two cases;
>>>>>
>>>>> 1. The node stops responding (meaning it's in an unknown state, so it is
>>>>> fenced to force it into a known state).
>>>>> 2. A resource / service fails to stop stop. In this case, the service is
>>>>> in an unknown state, so the node is fenced to force the service into a
>>>>> known state so that it can be safely recovered on the peer.
>>>>>
>>>>> Graceful withdrawal of the node from the cluster, and graceful stopping
>>>>> of services will not lead to a fence (because in both cases, the node /
>>>>> service are in a known state - off).
>>>>>
>>>   
>