[ClusterLabs] What triggers fencing?

Klaus Wenninger kwenning at redhat.com
Wed Jul 11 05:06:56 EDT 2018


On 07/11/2018 05:48 AM, Andrei Borzenkov wrote:
> 11.07.2018 05:45, Confidential Company пишет:
>> Not true, the faster node will kill the slower node first. It is
>> possible that through misconfiguration, both could die, but it's rare
>> and easily avoided with a 'delay="15"' set on the fence config for the
>> node you want to win.
>>
>> Don't use a delay on the other node, just the node you want to live in
>> such a case.
>>
>> **
>>                 1. Given Active/Passive setup, resources are active on Node1
>>                 2. fence1(prefers to Node1, delay=15) and fence2(prefers to
>> Node2, delay=30)
>>                 3. Node2 goes down
>>                 4. Node1 thinks Node2 goes down / Node2 thinks Node1 goes
>> down
> If node2 is down, it cannot think anything.

True. Assuming it is not really down but just somehow disconnected
for my answer below.

>
>>                 5. fence1 counts 15 seconds before he fence Node1 while
>> fence2 counts 30 seconds before he fence Node2
>>                 6. Since fence1 do have shorter time than fence2, fence1
>> executes and shutdown Node1.
>>                 7. fence1(action: shutdown Node1)  will trigger first
>> always because it has shorter delay than fence2.
>>
>> ** Okay what's important is that they should be different. But in the case
>> above, even though Node2 goes down but Node1 has shorter delay, Node1 gets
>> fenced/shutdown. This is a sample scenario. I don't get the point. Can you
>> comment on this?

You didn't send the actual config but from your description
I get the scenario that way:

fencing-resource fence1 is running on Node2 and it is there
to fence Node1 and it has a delay of 15s.
fencing-resource fence2 is running on Node1 and it is there
to fence Node2 and it has a delay of 30s.
If they now begin to fence each other at the same time the
node actually fenced would be Node1 of course as the
fencing-resource fence1 is gonna shoot 15s earlier that the
fence2.
Looks consistent to me ...

Regards,
Klaus

>>
>> Thanks
>>
>> On Tue, Jul 10, 2018 at 12:18 AM, Klaus Wenninger <kwenning at redhat.com>
>> wrote:
>>
>>> On 07/09/2018 05:53 PM, Digimer wrote:
>>>> On 2018-07-09 11:45 AM, Klaus Wenninger wrote:
>>>>> On 07/09/2018 05:33 PM, Digimer wrote:
>>>>>> On 2018-07-09 09:56 AM, Klaus Wenninger wrote:
>>>>>>> On 07/09/2018 03:49 PM, Digimer wrote:
>>>>>>>> On 2018-07-09 08:31 AM, Klaus Wenninger wrote:
>>>>>>>>> On 07/09/2018 02:04 PM, Confidential Company wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Any ideas what triggers fencing script or stonith?
>>>>>>>>>>
>>>>>>>>>> Given the setup below:
>>>>>>>>>> 1. I have two nodes
>>>>>>>>>> 2. Configured fencing on both nodes
>>>>>>>>>> 3. Configured delay=15 and delay=30 on fence1(for Node1) and
>>>>>>>>>> fence2(for Node2) respectively
>>>>>>>>>>
>>>>>>>>>> *What does it mean to configured delay in stonith? wait for 15
>>> seconds
>>>>>>>>>> before it fence the node?
>>>>>>>>> Given that on a 2-node-cluster you don't have real quorum to make
>>> one
>>>>>>>>> partial cluster fence the rest of the nodes the different delays
>>> are meant
>>>>>>>>> to prevent a fencing-race.
>>>>>>>>> Without different delays that would lead to both nodes fencing each
>>>>>>>>> other at the same time - finally both being down.
>>>>>>>> Not true, the faster node will kill the slower node first. It is
>>>>>>>> possible that through misconfiguration, both could die, but it's rare
>>>>>>>> and easily avoided with a 'delay="15"' set on the fence config for
>>> the
>>>>>>>> node you want to win.
>>>>>>> What exactly is not true? Aren't we saying the same?
>>>>>>> Of course one of the delays can be 0 (most important is that
>>>>>>> they are different).
>>>>>> Perhaps I misunderstood your message. It seemed to me that the
>>>>>> implication was that fencing in 2-node without a delay always ends up
>>>>>> with both nodes being down, which isn't the case. It can happen if the
>>>>>> fence methods are not setup right (ie: the node isn't set to
>>> immediately
>>>>>> power off on ACPI power button event).
>>>>> Yes, a misunderstanding I guess.
>>>>>
>>>>> Should have been more verbose in saying that due to the
>>>>> time between the fencing-command fired off to the fencing
>>>>> device and the actual fencing taking place (as you state
>>>>> dependent on how it is configured in detail - but a measurable
>>>>> time in all cases) there is a certain probability that when
>>>>> both nodes start fencing at roughly the same time we will
>>>>> end up with 2 nodes down.
>>>>>
>>>>> Everybody has to find his own tradeoff between reliability
>>>>> fence-races are prevented and fencing delay I guess.
>>>> We've used this;
>>>>
>>>> 1. IPMI (with the guest OS set to immediately power off) as primary,
>>>> with a 15 second delay on the active node.
>>>>
>>>> 2. Two Switched PDUs (two power circuits, two PSUs) as backup fencing
>>>> for when IPMI fails, with no delay.
>>>>
>>>> In ~8 years, across dozens and dozens of clusters and countless fence
>>>> actions, we've never had a dual-fence event (where both nodes go down).
>>>> So it can be done safely, but as always, test test test before prod.
>>> No doubt about that this setup is working reliably.
>>> You just have to know your fencing-devices and
>>> which delays they involve.
>>>
>>> If we are talking about SBD (with disk as otherwise
>>> it doesn't work in a sensible way in 2-node-clusters)
>>> for instance I would strongly advise using a delay.
>>>
>>> So I guess it is important to understand the basic
>>> idea behind this different delay-based fence-race
>>> avoidance.
>>> Afterwards you can still decide why it is no issue
>>> in your own setup.
>>>
>>>>>> If the delay is set on both nodes, and they are different, it will work
>>>>>> fine. The reason not to do this is that if you use 0, then don't use
>>>>>> anything at all (0 is default), and any other value causes avoidable
>>>>>> fence delays.
>>>>>>
>>>>>>>> Don't use a delay on the other node, just the node you want to live
>>> in
>>>>>>>> such a case.
>>>>>>>>
>>>>>>>>>> *Given Node1 is active and Node2 goes down, does it mean fence1
>>> will
>>>>>>>>>> first execute and shutdowns Node1 even though Node2 goes down?
>>>>>>>>> If Node2 managed to sign off properly it will not.
>>>>>>>>> If network-connection is down so that Node2 can't inform Node1 that
>>> it
>>>>>>>>> is going
>>>>>>>>> down and finally has stopped all resources it will be fenced by
>>> Node1.
>>>>>>>>> Regards,
>>>>>>>>> Klaus
>>>>>>>> Fencing occurs in two cases;
>>>>>>>>
>>>>>>>> 1. The node stops responding (meaning it's in an unknown state, so
>>> it is
>>>>>>>> fenced to force it into a known state).
>>>>>>>> 2. A resource / service fails to stop stop. In this case, the
>>> service is
>>>>>>>> in an unknown state, so the node is fenced to force the service into
>>> a
>>>>>>>> known state so that it can be safely recovered on the peer.
>>>>>>>>
>>>>>>>> Graceful withdrawal of the node from the cluster, and graceful
>>> stopping
>>>>>>>> of services will not lead to a fence (because in both cases, the
>>> node /
>>>>>>>> service are in a known state - off).
>>>>>>>>
>>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org




More information about the Users mailing list