[ClusterLabs] What triggers fencing?

Digimer lists at alteeve.ca
Mon Jul 9 11:53:01 EDT 2018


On 2018-07-09 11:45 AM, Klaus Wenninger wrote:
> On 07/09/2018 05:33 PM, Digimer wrote:
>> On 2018-07-09 09:56 AM, Klaus Wenninger wrote:
>>> On 07/09/2018 03:49 PM, Digimer wrote:
>>>> On 2018-07-09 08:31 AM, Klaus Wenninger wrote:
>>>>> On 07/09/2018 02:04 PM, Confidential Company wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Any ideas what triggers fencing script or stonith?
>>>>>>
>>>>>> Given the setup below:
>>>>>> 1. I have two nodes
>>>>>> 2. Configured fencing on both nodes
>>>>>> 3. Configured delay=15 and delay=30 on fence1(for Node1) and
>>>>>> fence2(for Node2) respectively
>>>>>>
>>>>>> *What does it mean to configured delay in stonith? wait for 15 seconds
>>>>>> before it fence the node?
>>>>> Given that on a 2-node-cluster you don't have real quorum to make one
>>>>> partial cluster fence the rest of the nodes the different delays are meant
>>>>> to prevent a fencing-race.
>>>>> Without different delays that would lead to both nodes fencing each
>>>>> other at the same time - finally both being down.
>>>> Not true, the faster node will kill the slower node first. It is
>>>> possible that through misconfiguration, both could die, but it's rare
>>>> and easily avoided with a 'delay="15"' set on the fence config for the
>>>> node you want to win.
>>> What exactly is not true? Aren't we saying the same?
>>> Of course one of the delays can be 0 (most important is that
>>> they are different).
>> Perhaps I misunderstood your message. It seemed to me that the
>> implication was that fencing in 2-node without a delay always ends up
>> with both nodes being down, which isn't the case. It can happen if the
>> fence methods are not setup right (ie: the node isn't set to immediately
>> power off on ACPI power button event).
> Yes, a misunderstanding I guess.
> 
> Should have been more verbose in saying that due to the
> time between the fencing-command fired off to the fencing
> device and the actual fencing taking place (as you state
> dependent on how it is configured in detail - but a measurable
> time in all cases) there is a certain probability that when
> both nodes start fencing at roughly the same time we will
> end up with 2 nodes down.
> 
> Everybody has to find his own tradeoff between reliability
> fence-races are prevented and fencing delay I guess.

We've used this;

1. IPMI (with the guest OS set to immediately power off) as primary,
with a 15 second delay on the active node.

2. Two Switched PDUs (two power circuits, two PSUs) as backup fencing
for when IPMI fails, with no delay.

In ~8 years, across dozens and dozens of clusters and countless fence
actions, we've never had a dual-fence event (where both nodes go down).
So it can be done safely, but as always, test test test before prod.

>> If the delay is set on both nodes, and they are different, it will work
>> fine. The reason not to do this is that if you use 0, then don't use
>> anything at all (0 is default), and any other value causes avoidable
>> fence delays.
>>
>>>> Don't use a delay on the other node, just the node you want to live in
>>>> such a case.
>>>>
>>>>>> *Given Node1 is active and Node2 goes down, does it mean fence1 will
>>>>>> first execute and shutdowns Node1 even though Node2 goes down?
>>>>> If Node2 managed to sign off properly it will not.
>>>>> If network-connection is down so that Node2 can't inform Node1 that it
>>>>> is going
>>>>> down and finally has stopped all resources it will be fenced by Node1.
>>>>>
>>>>> Regards,
>>>>> Klaus
>>>> Fencing occurs in two cases;
>>>>
>>>> 1. The node stops responding (meaning it's in an unknown state, so it is
>>>> fenced to force it into a known state).
>>>> 2. A resource / service fails to stop stop. In this case, the service is
>>>> in an unknown state, so the node is fenced to force the service into a
>>>> known state so that it can be safely recovered on the peer.
>>>>
>>>> Graceful withdrawal of the node from the cluster, and graceful stopping
>>>> of services will not lead to a fence (because in both cases, the node /
>>>> service are in a known state - off).
>>>>
>>
>>   


-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould



More information about the Users mailing list