[ClusterLabs] Antw: [EXT] Stonith failing

Klaus Wenninger kwenning at redhat.com
Mon Aug 17 03:06:12 EDT 2020


On 8/16/20 11:40 AM, Andrei Borzenkov wrote:
> 16.08.2020 04:25, Reid Wahl пишет:
>>
>>> - considering that I have both nodes with stonith against the other node,
>>> once the two nodes can communicate, how can I be sure the two nodes will
>>> not try to stonith each other?
>>>
>> The simplest option is to add a delay attribute (e.g., delay=10) to one of
>> the stonith devices. That way, if both nodes want to fence each other, the
>> node whose stonith device has a delay configured will wait for the delay to
>> expire before executing the reboot action.
If your fence-agent supports a delay attribute you can of course use
that. As this isn't available with every fence-agent or is looking
differently depending on the fence-agent we've introduced
pcmk_delay_max & pcmk_delay_base. These are applied prior
to actually calling the fence-agent and thus are always available and
always look the same. The delay is gonna be some random time
between pcmk_delay_base and pcmk_delay_max.
This takes us to another approach how you can reduce chances
of a fatal fence-race. Assuming that the reason why the fence-race
is triggered is detected around the same time when just adding a
random time you will very likely prevent them killing each other.
This is especially interesting when there is no clear / easy way
to determine which of the nodes is more important at this time.
>>
> Current pacemaker (2.0.4) also supports priority-fencing-delay option
> that computes delay based on which resources are active on specific
> node, so favoring node with "more important" resources.
>
>> Alternatively, you can set up corosync-qdevice, using a separate system
>> running qnetd server as a quorum arbitrator.
>>
> Any solution that is based on node suicide is prone to complete cluster
> loss. In particular, in two node cluster with qdevice surviving node
> will commit suicide is qnetd is not accessible.
I don't think that what Reid suggested was going for nodes
that loose quorum to commit suicide right away.
You can use quorum simply as a means of preventing fence-races
otherwise inherent to 2-node-clusters.
>
> As long as external stonith is reasonably reliable it is much preferred
> to any solution based on quorum (unless you have very specific
> requirements and can tolerate running remaining nodes in "frozen" mode
> to limit unavailability).
Well we can name the predominant scenario why one might not want to depend
on fencing-devices like ipmi: If you want to cover a scenario where the
nodes don't
just loose corosync connectivity but as well access from one node to the
fencing
device of the other is interrupted you probably won't get around an
approach that
involves some kind of arbitrator.
>
> And before someone jumps in - SBD falls into "solution based on suicide"
> as well.
Got your point without that hint ;-)

Klaus
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



More information about the Users mailing list