[ClusterLabs] What triggers fencing?

Wed Jul 11 02:45:35 UTC 2018

Not true, the faster node will kill the slower node first. It is
possible that through misconfiguration, both could die, but it's rare
and easily avoided with a 'delay="15"' set on the fence config for the
node you want to win.

Don't use a delay on the other node, just the node you want to live in
such a case.

**
                1. Given Active/Passive setup, resources are active on Node1
                2. fence1(prefers to Node1, delay=15) and fence2(prefers to
Node2, delay=30)
                3. Node2 goes down
                4. Node1 thinks Node2 goes down / Node2 thinks Node1 goes
down
                5. fence1 counts 15 seconds before he fence Node1 while
fence2 counts 30 seconds before he fence Node2
                6. Since fence1 do have shorter time than fence2, fence1
executes and shutdown Node1.
                7. fence1(action: shutdown Node1)  will trigger first
always because it has shorter delay than fence2.

** Okay what's important is that they should be different. But in the case
above, even though Node2 goes down but Node1 has shorter delay, Node1 gets
fenced/shutdown. This is a sample scenario. I don't get the point. Can you
comment on this?

Thanks

On Tue, Jul 10, 2018 at 12:18 AM, Klaus Wenninger <kwenning at redhat.com>
wrote:

> On 07/09/2018 05:53 PM, Digimer wrote:
> > On 2018-07-09 11:45 AM, Klaus Wenninger wrote:
> >> On 07/09/2018 05:33 PM, Digimer wrote:
> >>> On 2018-07-09 09:56 AM, Klaus Wenninger wrote:
> >>>> On 07/09/2018 03:49 PM, Digimer wrote:
> >>>>> On 2018-07-09 08:31 AM, Klaus Wenninger wrote:
> >>>>>> On 07/09/2018 02:04 PM, Confidential Company wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> Any ideas what triggers fencing script or stonith?
> >>>>>>>
> >>>>>>> Given the setup below:
> >>>>>>> 1. I have two nodes
> >>>>>>> 2. Configured fencing on both nodes
> >>>>>>> 3. Configured delay=15 and delay=30 on fence1(for Node1) and
> >>>>>>> fence2(for Node2) respectively
> >>>>>>>
> >>>>>>> *What does it mean to configured delay in stonith? wait for 15
> seconds
> >>>>>>> before it fence the node?
> >>>>>> Given that on a 2-node-cluster you don't have real quorum to make
> one
> >>>>>> partial cluster fence the rest of the nodes the different delays
> are meant
> >>>>>> to prevent a fencing-race.
> >>>>>> Without different delays that would lead to both nodes fencing each
> >>>>>> other at the same time - finally both being down.
> >>>>> Not true, the faster node will kill the slower node first. It is
> >>>>> possible that through misconfiguration, both could die, but it's rare
> >>>>> and easily avoided with a 'delay="15"' set on the fence config for
> the
> >>>>> node you want to win.
> >>>> What exactly is not true? Aren't we saying the same?
> >>>> Of course one of the delays can be 0 (most important is that
> >>>> they are different).
> >>> Perhaps I misunderstood your message. It seemed to me that the
> >>> implication was that fencing in 2-node without a delay always ends up
> >>> with both nodes being down, which isn't the case. It can happen if the
> >>> fence methods are not setup right (ie: the node isn't set to
> immediately
> >>> power off on ACPI power button event).
> >> Yes, a misunderstanding I guess.
> >>
> >> Should have been more verbose in saying that due to the
> >> time between the fencing-command fired off to the fencing
> >> device and the actual fencing taking place (as you state
> >> dependent on how it is configured in detail - but a measurable
> >> time in all cases) there is a certain probability that when
> >> both nodes start fencing at roughly the same time we will
> >> end up with 2 nodes down.
> >>
> >> Everybody has to find his own tradeoff between reliability
> >> fence-races are prevented and fencing delay I guess.
> > We've used this;
> >
> > 1. IPMI (with the guest OS set to immediately power off) as primary,
> > with a 15 second delay on the active node.
> >
> > 2. Two Switched PDUs (two power circuits, two PSUs) as backup fencing
> > for when IPMI fails, with no delay.
> >
> > In ~8 years, across dozens and dozens of clusters and countless fence
> > actions, we've never had a dual-fence event (where both nodes go down).
> > So it can be done safely, but as always, test test test before prod.
>
> No doubt about that this setup is working reliably.
> You just have to know your fencing-devices and
> which delays they involve.
>
> If we are talking about SBD (with disk as otherwise
> it doesn't work in a sensible way in 2-node-clusters)
> for instance I would strongly advise using a delay.
>
> So I guess it is important to understand the basic
> idea behind this different delay-based fence-race
> avoidance.
> Afterwards you can still decide why it is no issue
> in your own setup.
>
> >
> >>> If the delay is set on both nodes, and they are different, it will work
> >>> fine. The reason not to do this is that if you use 0, then don't use
> >>> anything at all (0 is default), and any other value causes avoidable
> >>> fence delays.
> >>>
> >>>>> Don't use a delay on the other node, just the node you want to live
> in
> >>>>> such a case.
> >>>>>
> >>>>>>> *Given Node1 is active and Node2 goes down, does it mean fence1
> will
> >>>>>>> first execute and shutdowns Node1 even though Node2 goes down?
> >>>>>> If Node2 managed to sign off properly it will not.
> >>>>>> If network-connection is down so that Node2 can't inform Node1 that
> it
> >>>>>> is going
> >>>>>> down and finally has stopped all resources it will be fenced by
> Node1.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Klaus
> >>>>> Fencing occurs in two cases;
> >>>>>
> >>>>> 1. The node stops responding (meaning it's in an unknown state, so
> it is
> >>>>> fenced to force it into a known state).
> >>>>> 2. A resource / service fails to stop stop. In this case, the
> service is
> >>>>> in an unknown state, so the node is fenced to force the service into
> a
> >>>>> known state so that it can be safely recovered on the peer.
> >>>>>
> >>>>> Graceful withdrawal of the node from the cluster, and graceful
> stopping
> >>>>> of services will not lead to a fence (because in both cases, the
> node /
> >>>>> service are in a known state - off).
> >>>>>
> >>>
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180711/e88674c4/attachment.html>