[ClusterLabs] What triggers fencing?

Tue Jul 10 23:48:11 EDT 2018

11.07.2018 05:45, Confidential Company пишет:
> Not true, the faster node will kill the slower node first. It is
> possible that through misconfiguration, both could die, but it's rare
> and easily avoided with a 'delay="15"' set on the fence config for the
> node you want to win.
> 
> Don't use a delay on the other node, just the node you want to live in
> such a case.
> 
> **
>                 1. Given Active/Passive setup, resources are active on Node1
>                 2. fence1(prefers to Node1, delay=15) and fence2(prefers to
> Node2, delay=30)
>                 3. Node2 goes down
>                 4. Node1 thinks Node2 goes down / Node2 thinks Node1 goes
> down

If node2 is down, it cannot think anything.

>                 5. fence1 counts 15 seconds before he fence Node1 while
> fence2 counts 30 seconds before he fence Node2
>                 6. Since fence1 do have shorter time than fence2, fence1
> executes and shutdown Node1.
>                 7. fence1(action: shutdown Node1)  will trigger first
> always because it has shorter delay than fence2.
> 
> ** Okay what's important is that they should be different. But in the case
> above, even though Node2 goes down but Node1 has shorter delay, Node1 gets
> fenced/shutdown. This is a sample scenario. I don't get the point. Can you
> comment on this?
> 
> Thanks
> 
> On Tue, Jul 10, 2018 at 12:18 AM, Klaus Wenninger <kwenning at redhat.com>
> wrote:
> 
>> On 07/09/2018 05:53 PM, Digimer wrote:
>>> On 2018-07-09 11:45 AM, Klaus Wenninger wrote:
>>>> On 07/09/2018 05:33 PM, Digimer wrote:
>>>>> On 2018-07-09 09:56 AM, Klaus Wenninger wrote:
>>>>>> On 07/09/2018 03:49 PM, Digimer wrote:
>>>>>>> On 2018-07-09 08:31 AM, Klaus Wenninger wrote:
>>>>>>>> On 07/09/2018 02:04 PM, Confidential Company wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Any ideas what triggers fencing script or stonith?
>>>>>>>>>
>>>>>>>>> Given the setup below:
>>>>>>>>> 1. I have two nodes
>>>>>>>>> 2. Configured fencing on both nodes
>>>>>>>>> 3. Configured delay=15 and delay=30 on fence1(for Node1) and
>>>>>>>>> fence2(for Node2) respectively
>>>>>>>>>
>>>>>>>>> *What does it mean to configured delay in stonith? wait for 15
>> seconds
>>>>>>>>> before it fence the node?
>>>>>>>> Given that on a 2-node-cluster you don't have real quorum to make
>> one
>>>>>>>> partial cluster fence the rest of the nodes the different delays
>> are meant
>>>>>>>> to prevent a fencing-race.
>>>>>>>> Without different delays that would lead to both nodes fencing each
>>>>>>>> other at the same time - finally both being down.
>>>>>>> Not true, the faster node will kill the slower node first. It is
>>>>>>> possible that through misconfiguration, both could die, but it's rare
>>>>>>> and easily avoided with a 'delay="15"' set on the fence config for
>> the
>>>>>>> node you want to win.
>>>>>> What exactly is not true? Aren't we saying the same?
>>>>>> Of course one of the delays can be 0 (most important is that
>>>>>> they are different).
>>>>> Perhaps I misunderstood your message. It seemed to me that the
>>>>> implication was that fencing in 2-node without a delay always ends up
>>>>> with both nodes being down, which isn't the case. It can happen if the
>>>>> fence methods are not setup right (ie: the node isn't set to
>> immediately
>>>>> power off on ACPI power button event).
>>>> Yes, a misunderstanding I guess.
>>>>
>>>> Should have been more verbose in saying that due to the
>>>> time between the fencing-command fired off to the fencing
>>>> device and the actual fencing taking place (as you state
>>>> dependent on how it is configured in detail - but a measurable
>>>> time in all cases) there is a certain probability that when
>>>> both nodes start fencing at roughly the same time we will
>>>> end up with 2 nodes down.
>>>>
>>>> Everybody has to find his own tradeoff between reliability
>>>> fence-races are prevented and fencing delay I guess.
>>> We've used this;
>>>
>>> 1. IPMI (with the guest OS set to immediately power off) as primary,
>>> with a 15 second delay on the active node.
>>>
>>> 2. Two Switched PDUs (two power circuits, two PSUs) as backup fencing
>>> for when IPMI fails, with no delay.
>>>
>>> In ~8 years, across dozens and dozens of clusters and countless fence
>>> actions, we've never had a dual-fence event (where both nodes go down).
>>> So it can be done safely, but as always, test test test before prod.
>>
>> No doubt about that this setup is working reliably.
>> You just have to know your fencing-devices and
>> which delays they involve.
>>
>> If we are talking about SBD (with disk as otherwise
>> it doesn't work in a sensible way in 2-node-clusters)
>> for instance I would strongly advise using a delay.
>>
>> So I guess it is important to understand the basic
>> idea behind this different delay-based fence-race
>> avoidance.
>> Afterwards you can still decide why it is no issue
>> in your own setup.
>>
>>>
>>>>> If the delay is set on both nodes, and they are different, it will work
>>>>> fine. The reason not to do this is that if you use 0, then don't use
>>>>> anything at all (0 is default), and any other value causes avoidable
>>>>> fence delays.
>>>>>
>>>>>>> Don't use a delay on the other node, just the node you want to live
>> in
>>>>>>> such a case.
>>>>>>>
>>>>>>>>> *Given Node1 is active and Node2 goes down, does it mean fence1
>> will
>>>>>>>>> first execute and shutdowns Node1 even though Node2 goes down?
>>>>>>>> If Node2 managed to sign off properly it will not.
>>>>>>>> If network-connection is down so that Node2 can't inform Node1 that
>> it
>>>>>>>> is going
>>>>>>>> down and finally has stopped all resources it will be fenced by
>> Node1.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Klaus
>>>>>>> Fencing occurs in two cases;
>>>>>>>
>>>>>>> 1. The node stops responding (meaning it's in an unknown state, so
>> it is
>>>>>>> fenced to force it into a known state).
>>>>>>> 2. A resource / service fails to stop stop. In this case, the
>> service is
>>>>>>> in an unknown state, so the node is fenced to force the service into
>> a
>>>>>>> known state so that it can be safely recovered on the peer.
>>>>>>>
>>>>>>> Graceful withdrawal of the node from the cluster, and graceful
>> stopping
>>>>>>> of services will not lead to a fence (because in both cases, the
>> node /
>>>>>>> service are in a known state - off).
>>>>>>>
>>>>>
>>>
>>
>>
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>