[ClusterLabs] issue during Pacemaker failover testing

Thu Aug 31 06:59:54 EDT 2023

I just tried removing all the quorum options setting back to defaults so no
last_man_standing or wait_for_all.
I still see the same behaviour where the third node is fenced if I bring
down services on two nodes.
Thanks
David

On Thu, 31 Aug 2023 at 11:44, Klaus Wenninger <kwenning at redhat.com> wrote:

>
>
> On Thu, Aug 31, 2023 at 12:28 PM David Dolan <daithidolan at gmail.com>
> wrote:
>
>>
>>
>> On Wed, 30 Aug 2023 at 17:35, David Dolan <daithidolan at gmail.com> wrote:
>>
>>>
>>>
>>> > Hi All,
>>>> >
>>>> > I'm running Pacemaker on Centos7
>>>> > Name        : pcs
>>>> > Version     : 0.9.169
>>>> > Release     : 3.el7.centos.3
>>>> > Architecture: x86_64
>>>> >
>>>> >
>>>> Besides the pcs-version versions of the other cluster-stack-components
>>>> could be interesting. (pacemaker, corosync)
>>>>
>>>  rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents"
>>> fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64
>>> corosynclib-2.4.5-7.el7_9.2.x86_64
>>> pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64
>>> fence-agents-common-4.2.1-41.el7_9.6.x86_64
>>> corosync-2.4.5-7.el7_9.2.x86_64
>>> pacemaker-cli-1.1.23-1.el7_9.1.x86_64
>>> pacemaker-1.1.23-1.el7_9.1.x86_64
>>> pcs-0.9.169-3.el7.centos.3.x86_64
>>> pacemaker-libs-1.1.23-1.el7_9.1.x86_64
>>>
>>>>
>>>>
>>>> > I'm performing some cluster failover tests in a 3 node cluster. We
>>>> have 3
>>>> > resources in the cluster.
>>>> > I was trying to see if I could get it working if 2 nodes fail at
>>>> different
>>>> > times. I'd like the 3 resources to then run on one node.
>>>> >
>>>> > The quorum options I've configured are as follows
>>>> > [root at node1 ~]# pcs quorum config
>>>> > Options:
>>>> >   auto_tie_breaker: 1
>>>> >   last_man_standing: 1
>>>> >   last_man_standing_window: 10000
>>>> >   wait_for_all: 1
>>>> >
>>>> >
>>>> Not sure if the combination of auto_tie_breaker and last_man_standing
>>>> makes
>>>> sense.
>>>> And as you have a cluster with an odd number of nodes auto_tie_breaker
>>>> should be
>>>> disabled anyway I guess.
>>>>
>>> Ah ok I'll try removing auto_tie_breaker and leave last_man_standing
>>>
>>>>
>>>>
>>>> > [root at node1 ~]# pcs quorum status
>>>> > Quorum information
>>>> > ------------------
>>>> > Date:             Wed Aug 30 11:20:04 2023
>>>> > Quorum provider:  corosync_votequorum
>>>> > Nodes:            3
>>>> > Node ID:          1
>>>> > Ring ID:          1/1538
>>>> > Quorate:          Yes
>>>> >
>>>> > Votequorum information
>>>> > ----------------------
>>>> > Expected votes:   3
>>>> > Highest expected: 3
>>>> > Total votes:      3
>>>> > Quorum:           2
>>>> > Flags:            Quorate WaitForAll LastManStanding AutoTieBreaker
>>>> >
>>>> > Membership information
>>>> > ----------------------
>>>> >     Nodeid      Votes    Qdevice Name
>>>> >          1          1         NR node1 (local)
>>>> >          2          1         NR node2
>>>> >          3          1         NR node3
>>>> >
>>>> > If I stop the cluster services on node 2 and 3, the groups all
>>>> failover to
>>>> > node 1 since it is the node with the lowest ID
>>>> > But if I stop them on node1 and node 2 or node1 and node3, the cluster
>>>> > fails.
>>>> >
>>>> > I tried adding this line to corosync.conf and I could then bring down
>>>> the
>>>> > services on node 1 and 2 or node 2 and 3 but if I left node 2 until
>>>> last,
>>>> > the cluster failed
>>>> > auto_tie_breaker_node: 1  3
>>>> >
>>>> > This line had the same outcome as using 1 3
>>>> > auto_tie_breaker_node: 1  2 3
>>>> >
>>>> >
>>>> Giving multiple auto_tie_breaker-nodes doesn't make sense to me but
>>>> rather
>>>> sounds dangerous if that configuration is possible at all.
>>>>
>>>> Maybe the misbehavior of last_man_standing is due to this (maybe not
>>>> recognized) misconfiguration.
>>>> Did you wait long enough between letting the 2 nodes fail?
>>>>
>>> I've done it so many times so I believe so. But I'll try remove the
>>> auto_tie_breaker config, leaving the last_man_standing. I'll also make sure
>>> I leave a couple of minutes between bringing down the nodes and post back.
>>>
>> Just confirming I removed the auto_tie_breaker config and tested. Quorum
>> configuration is as follows:
>>  Options:
>>   last_man_standing: 1
>>   last_man_standing_window: 10000
>>   wait_for_all: 1
>>
>> I waited 2-3 minutes between stopping cluster services on two nodes via
>> pcs cluster stop
>> The remaining cluster node is then fenced. I was hoping the remaining
>> node would stay online running the resources.
>>
>
> Yep - that would've been my understanding as well.
> But honestly I've never used last_man_standing in this context - wasn't
> even aware that it was
> offered without qdevice nor have I checked how it is implemented.
>
> Klaus
>
>>
>>
>>>> Klaus
>>>>
>>>>
>>>> > So I'd like it to failover when any combination of two nodes fail but
>>>> I've
>>>> > only had success when the middle node isn't last.
>>>> >
>>>> > Thanks
>>>> > David
>>>>
>>>>
>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20230831/c0df67b6/attachment-0001.htm>