[ClusterLabs] issue during Pacemaker failover testing

Thu Aug 31 06:44:14 EDT 2023

On Thu, Aug 31, 2023 at 12:28 PM David Dolan <daithidolan at gmail.com> wrote:

>
>
> On Wed, 30 Aug 2023 at 17:35, David Dolan <daithidolan at gmail.com> wrote:
>
>>
>>
>> > Hi All,
>>> >
>>> > I'm running Pacemaker on Centos7
>>> > Name        : pcs
>>> > Version     : 0.9.169
>>> > Release     : 3.el7.centos.3
>>> > Architecture: x86_64
>>> >
>>> >
>>> Besides the pcs-version versions of the other cluster-stack-components
>>> could be interesting. (pacemaker, corosync)
>>>
>>  rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents"
>> fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64
>> corosynclib-2.4.5-7.el7_9.2.x86_64
>> pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64
>> fence-agents-common-4.2.1-41.el7_9.6.x86_64
>> corosync-2.4.5-7.el7_9.2.x86_64
>> pacemaker-cli-1.1.23-1.el7_9.1.x86_64
>> pacemaker-1.1.23-1.el7_9.1.x86_64
>> pcs-0.9.169-3.el7.centos.3.x86_64
>> pacemaker-libs-1.1.23-1.el7_9.1.x86_64
>>
>>>
>>>
>>> > I'm performing some cluster failover tests in a 3 node cluster. We
>>> have 3
>>> > resources in the cluster.
>>> > I was trying to see if I could get it working if 2 nodes fail at
>>> different
>>> > times. I'd like the 3 resources to then run on one node.
>>> >
>>> > The quorum options I've configured are as follows
>>> > [root at node1 ~]# pcs quorum config
>>> > Options:
>>> >   auto_tie_breaker: 1
>>> >   last_man_standing: 1
>>> >   last_man_standing_window: 10000
>>> >   wait_for_all: 1
>>> >
>>> >
>>> Not sure if the combination of auto_tie_breaker and last_man_standing
>>> makes
>>> sense.
>>> And as you have a cluster with an odd number of nodes auto_tie_breaker
>>> should be
>>> disabled anyway I guess.
>>>
>> Ah ok I'll try removing auto_tie_breaker and leave last_man_standing
>>
>>>
>>>
>>> > [root at node1 ~]# pcs quorum status
>>> > Quorum information
>>> > ------------------
>>> > Date:             Wed Aug 30 11:20:04 2023
>>> > Quorum provider:  corosync_votequorum
>>> > Nodes:            3
>>> > Node ID:          1
>>> > Ring ID:          1/1538
>>> > Quorate:          Yes
>>> >
>>> > Votequorum information
>>> > ----------------------
>>> > Expected votes:   3
>>> > Highest expected: 3
>>> > Total votes:      3
>>> > Quorum:           2
>>> > Flags:            Quorate WaitForAll LastManStanding AutoTieBreaker
>>> >
>>> > Membership information
>>> > ----------------------
>>> >     Nodeid      Votes    Qdevice Name
>>> >          1          1         NR node1 (local)
>>> >          2          1         NR node2
>>> >          3          1         NR node3
>>> >
>>> > If I stop the cluster services on node 2 and 3, the groups all
>>> failover to
>>> > node 1 since it is the node with the lowest ID
>>> > But if I stop them on node1 and node 2 or node1 and node3, the cluster
>>> > fails.
>>> >
>>> > I tried adding this line to corosync.conf and I could then bring down
>>> the
>>> > services on node 1 and 2 or node 2 and 3 but if I left node 2 until
>>> last,
>>> > the cluster failed
>>> > auto_tie_breaker_node: 1  3
>>> >
>>> > This line had the same outcome as using 1 3
>>> > auto_tie_breaker_node: 1  2 3
>>> >
>>> >
>>> Giving multiple auto_tie_breaker-nodes doesn't make sense to me but
>>> rather
>>> sounds dangerous if that configuration is possible at all.
>>>
>>> Maybe the misbehavior of last_man_standing is due to this (maybe not
>>> recognized) misconfiguration.
>>> Did you wait long enough between letting the 2 nodes fail?
>>>
>> I've done it so many times so I believe so. But I'll try remove the
>> auto_tie_breaker config, leaving the last_man_standing. I'll also make sure
>> I leave a couple of minutes between bringing down the nodes and post back.
>>
> Just confirming I removed the auto_tie_breaker config and tested. Quorum
> configuration is as follows:
>  Options:
>   last_man_standing: 1
>   last_man_standing_window: 10000
>   wait_for_all: 1
>
> I waited 2-3 minutes between stopping cluster services on two nodes via
> pcs cluster stop
> The remaining cluster node is then fenced. I was hoping the remaining node
> would stay online running the resources.
>

Yep - that would've been my understanding as well.
But honestly I've never used last_man_standing in this context - wasn't
even aware that it was
offered without qdevice nor have I checked how it is implemented.

Klaus

>
>
>>> Klaus
>>>
>>>
>>> > So I'd like it to failover when any combination of two nodes fail but
>>> I've
>>> > only had success when the middle node isn't last.
>>> >
>>> > Thanks
>>> > David
>>>
>>>
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20230831/acc3196b/attachment.htm>