[ClusterLabs] issue during Pacemaker failover testing

Thu Aug 31 06:28:42 EDT 2023

On Wed, 30 Aug 2023 at 17:35, David Dolan <daithidolan at gmail.com> wrote:

>
>
> > Hi All,
>> >
>> > I'm running Pacemaker on Centos7
>> > Name        : pcs
>> > Version     : 0.9.169
>> > Release     : 3.el7.centos.3
>> > Architecture: x86_64
>> >
>> >
>> Besides the pcs-version versions of the other cluster-stack-components
>> could be interesting. (pacemaker, corosync)
>>
>  rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents"
> fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64
> corosynclib-2.4.5-7.el7_9.2.x86_64
> pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64
> fence-agents-common-4.2.1-41.el7_9.6.x86_64
> corosync-2.4.5-7.el7_9.2.x86_64
> pacemaker-cli-1.1.23-1.el7_9.1.x86_64
> pacemaker-1.1.23-1.el7_9.1.x86_64
> pcs-0.9.169-3.el7.centos.3.x86_64
> pacemaker-libs-1.1.23-1.el7_9.1.x86_64
>
>>
>>
>> > I'm performing some cluster failover tests in a 3 node cluster. We have
>> 3
>> > resources in the cluster.
>> > I was trying to see if I could get it working if 2 nodes fail at
>> different
>> > times. I'd like the 3 resources to then run on one node.
>> >
>> > The quorum options I've configured are as follows
>> > [root at node1 ~]# pcs quorum config
>> > Options:
>> >   auto_tie_breaker: 1
>> >   last_man_standing: 1
>> >   last_man_standing_window: 10000
>> >   wait_for_all: 1
>> >
>> >
>> Not sure if the combination of auto_tie_breaker and last_man_standing
>> makes
>> sense.
>> And as you have a cluster with an odd number of nodes auto_tie_breaker
>> should be
>> disabled anyway I guess.
>>
> Ah ok I'll try removing auto_tie_breaker and leave last_man_standing
>
>>
>>
>> > [root at node1 ~]# pcs quorum status
>> > Quorum information
>> > ------------------
>> > Date:             Wed Aug 30 11:20:04 2023
>> > Quorum provider:  corosync_votequorum
>> > Nodes:            3
>> > Node ID:          1
>> > Ring ID:          1/1538
>> > Quorate:          Yes
>> >
>> > Votequorum information
>> > ----------------------
>> > Expected votes:   3
>> > Highest expected: 3
>> > Total votes:      3
>> > Quorum:           2
>> > Flags:            Quorate WaitForAll LastManStanding AutoTieBreaker
>> >
>> > Membership information
>> > ----------------------
>> >     Nodeid      Votes    Qdevice Name
>> >          1          1         NR node1 (local)
>> >          2          1         NR node2
>> >          3          1         NR node3
>> >
>> > If I stop the cluster services on node 2 and 3, the groups all failover
>> to
>> > node 1 since it is the node with the lowest ID
>> > But if I stop them on node1 and node 2 or node1 and node3, the cluster
>> > fails.
>> >
>> > I tried adding this line to corosync.conf and I could then bring down
>> the
>> > services on node 1 and 2 or node 2 and 3 but if I left node 2 until
>> last,
>> > the cluster failed
>> > auto_tie_breaker_node: 1  3
>> >
>> > This line had the same outcome as using 1 3
>> > auto_tie_breaker_node: 1  2 3
>> >
>> >
>> Giving multiple auto_tie_breaker-nodes doesn't make sense to me but rather
>> sounds dangerous if that configuration is possible at all.
>>
>> Maybe the misbehavior of last_man_standing is due to this (maybe not
>> recognized) misconfiguration.
>> Did you wait long enough between letting the 2 nodes fail?
>>
> I've done it so many times so I believe so. But I'll try remove the
> auto_tie_breaker config, leaving the last_man_standing. I'll also make sure
> I leave a couple of minutes between bringing down the nodes and post back.
>
Just confirming I removed the auto_tie_breaker config and tested. Quorum
configuration is as follows:
 Options:
  last_man_standing: 1
  last_man_standing_window: 10000
  wait_for_all: 1

I waited 2-3 minutes between stopping cluster services on two nodes via pcs
cluster stop
The remaining cluster node is then fenced. I was hoping the remaining node
would stay online running the resources.

>> Klaus
>>
>>
>> > So I'd like it to failover when any combination of two nodes fail but
>> I've
>> > only had success when the middle node isn't last.
>> >
>> > Thanks
>> > David
>>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20230831/d6d345ec/attachment-0001.htm>