[ClusterLabs] Antw: [EXT] Re: Single-node automated startup question

Thu Apr 15 02:55:52 EDT 2021

>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 14.04.2021 um 18:35 in
Nachricht
<00635dba0dfc70430d4fd7820677b47d242d65d2.camel at redhat.com>:

[...]
>> 
>> Startup fencing is pacemaker default (startup‑fencing cluster
>> option).
> 
> Start‑up fencing will have the desired effect in >2 node cluster, but
> in 2‑node cluster the corosync wait_for_all option is key.

This is another good example where pacemaker is (maybe for historic reasons)
more complicated than necessary (IMHO):
Why not have a single "cluster-formation-timeout" that waits for nodes to join
when initially forming a cluster (i.e. the node starting has no quorum (yet))?
So if that timeout expired and there is no quorum (subject of other
configuration parameters), the node will commit suicide (self-fencing,
preferably to "off" instead of "reboot").
Of course any two-node cluster would need some tie-breaker (like grabbing some
exclusive lock on a shared storage).

> 
> If wait_for_all is true (which is the default when two_node is set),
> then a node that comes up alone will wait until it sees the other node
> at least once before becoming quorate. This prevents an isolated node
> from coming up and fencing a node that's happily running.
> 
> Setting wait_for_all to false will make an isolated node immediately
> become quorate. It will do what you want, which is fence the other node
> and take over resources, but the danger is that this node is the one
> that's having trouble (e.g. can't see the other node due to a network
> card issue). The healthy node could fence the unhealthy node, which
> might then reboot and come up and shoot the healthy node.
> 
> There's no direct equivalent of a delay before becoming quorate, but I
> don't think that helps ‑‑ the boot time acts as a sort of random delay,
> and a delay doesn't help the issue of an unhealthy node shooting a
> healthy one.
> 
> My recommendation would be to set wait_for_all to true as long as both
> nodes are known to be healthy. Once an unhealthy node is down and
> expected to stay down, set wait_for_all to false on the healthy node so
> it can reboot and bring the cluster up. (The unhealthy node will still
> have wait_for_all=true, so it won't cause any trouble even if it comes
> up.) 
> 
[...]

Regards,
Ulrich