[ClusterLabs] Failure of preferred node in a 2 node cluster

Mon Apr 30 07:25:48 UTC 2018

On 04/30/2018 08:51 AM, Christine Caulfield wrote:
> On 29/04/18 13:22, Andrei Borzenkov wrote:
>> 29.04.2018 04:19, Wei Shan пишет:
>>> Hi,
>>>
>>> I'm using Redhat Cluster Suite 7with watchdog timer based fence agent. I
>>> understand this is a really bad setup but this is what the end-user wants.
>>>
>>> ATB => auto_tie_breaker
>>>
>>> "When the auto_tie_breaker is used in even-number member clusters, then the
>>> failure of the partition containing the auto_tie_breaker_node (by default
>>> the node with lowest ID) will cause other partition to become inquorate and
>>> it will self-fence. In 2-node clusters with auto_tie_breaker this means
>>> that failure of node favoured by auto_tie_breaker_node (typically nodeid 1)
>>> will result in reboot of other node (typically nodeid 2) that detects the
>>> inquorate state. If this is undesirable then corosync-qdevice can be used
>>> instead of the auto_tie_breaker to provide additional vote to quorum making
>>> behaviour closer to odd-number member clusters."
>>>
>> That's not what upstream corosync manual pages says. Corosync itself
>> won't initiate self-fencing, it just marks node as being out of quorum.
>> What happens later depends on higher layers like pacemaker. Pacemaker
>> can be configured to commit suicide, but can also be configured to
>> ignore quorum completely. I am not familiar with details how RHCS
>> behaves by default.
>>
>> I just tested on vanilla corosync+pacemaker (openSUSE Tumbleweed) and
>> nothing happens when I kill lowest node in two-node configuration.
>>
> That is the expected behaviour for a 2 node ATB cluster. If the
> preferred node is not available then the remaining node will stall until
> it comes back again. It sound odd, but that's what happens. A preferred
> node is a preferred node. If it can move from one to the other when it
> fails then it's not a preferred node ... it's just a node :)

If I'm understanding the setup right we are talking of
2-node + sbd with just watchdog-fencing here.
Guess we've discussed that setup a couple of times
already and have come to the conclusion that it
doesn't make very much sense as the availability
is always gonna be (slightly) below the availability of the
preferred-node alone without a cluster.
(Given ATB comes with suicide as this is the only
way where it would be a safe/working cluster at all.)

Only reason why one might still do that is some kind of
load-balancing in case both nodes are available or
some interimistic setup before adding an additional
node, additional watchdog-devices, a shared-disk
for sbd or - as Chrissie already suggested - qdevice.

Regards,
Klaus

>
> If you need full resilient failover for 2 nodes then qdevice is more
> likely what you need.
>
> Chrissie
>
>
>> If your cluster nodes are configured to commit suicide, what happens
>> after reboot depends on at least wait_for_all corosync setting. With
>> wait_for_all=1 (default in two_node) and without a) ignoring quorum
>> state and b) having fencing resource pacemaker on your node will wait
>> indefinitely after reboot because partner is not available.
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org