[ClusterLabs] issue during Pacemaker failover testing

Mon Sep 4 06:59:35 EDT 2023

On Mon, Sep 4, 2023 at 1:45 PM David Dolan <daithidolan at gmail.com> wrote:
>
> Hi Klaus,
>
> With default quorum options I've performed the following on my 3 node cluster
>
> Bring down cluster services on one node - the running services migrate to another node
> Wait 3 minutes
> Bring down cluster services on one of the two remaining nodes - the surviving node in the cluster is then fenced
>

Is it fenced or is it reset? It is not the same.

The default for no-quorum-policy is "stop". So you either have
"no-quorum-policy" set to "suicide", or node is reset by something
outside of pacemaker. This "something" may initiate fencing too.

> Instead of the surviving node being fenced, I hoped that the services would migrate and run on that remaining node.
>
> Just looking for confirmation that my understanding is ok and if I'm missing something?
>
> Thanks
> David
>
>
>
> On Thu, 31 Aug 2023 at 11:59, David Dolan <daithidolan at gmail.com> wrote:
>>
>> I just tried removing all the quorum options setting back to defaults so no last_man_standing or wait_for_all.
>> I still see the same behaviour where the third node is fenced if I bring down services on two nodes.
>> Thanks
>> David
>>
>> On Thu, 31 Aug 2023 at 11:44, Klaus Wenninger <kwenning at redhat.com> wrote:
>>>
>>>
>>>
>>> On Thu, Aug 31, 2023 at 12:28 PM David Dolan <daithidolan at gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> On Wed, 30 Aug 2023 at 17:35, David Dolan <daithidolan at gmail.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> > Hi All,
>>>>>> >
>>>>>> > I'm running Pacemaker on Centos7
>>>>>> > Name        : pcs
>>>>>> > Version     : 0.9.169
>>>>>> > Release     : 3.el7.centos.3
>>>>>> > Architecture: x86_64
>>>>>> >
>>>>>> >
>>>>>> Besides the pcs-version versions of the other cluster-stack-components
>>>>>> could be interesting. (pacemaker, corosync)
>>>>>
>>>>>  rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents"
>>>>> fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64
>>>>> corosynclib-2.4.5-7.el7_9.2.x86_64
>>>>> pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64
>>>>> fence-agents-common-4.2.1-41.el7_9.6.x86_64
>>>>> corosync-2.4.5-7.el7_9.2.x86_64
>>>>> pacemaker-cli-1.1.23-1.el7_9.1.x86_64
>>>>> pacemaker-1.1.23-1.el7_9.1.x86_64
>>>>> pcs-0.9.169-3.el7.centos.3.x86_64
>>>>> pacemaker-libs-1.1.23-1.el7_9.1.x86_64
>>>>>>
>>>>>>
>>>>>>
>>>>>> > I'm performing some cluster failover tests in a 3 node cluster. We have 3
>>>>>> > resources in the cluster.
>>>>>> > I was trying to see if I could get it working if 2 nodes fail at different
>>>>>> > times. I'd like the 3 resources to then run on one node.
>>>>>> >
>>>>>> > The quorum options I've configured are as follows
>>>>>> > [root at node1 ~]# pcs quorum config
>>>>>> > Options:
>>>>>> >   auto_tie_breaker: 1
>>>>>> >   last_man_standing: 1
>>>>>> >   last_man_standing_window: 10000
>>>>>> >   wait_for_all: 1
>>>>>> >
>>>>>> >
>>>>>> Not sure if the combination of auto_tie_breaker and last_man_standing makes
>>>>>> sense.
>>>>>> And as you have a cluster with an odd number of nodes auto_tie_breaker
>>>>>> should be
>>>>>> disabled anyway I guess.
>>>>>
>>>>> Ah ok I'll try removing auto_tie_breaker and leave last_man_standing
>>>>>>
>>>>>>
>>>>>>
>>>>>> > [root at node1 ~]# pcs quorum status
>>>>>> > Quorum information
>>>>>> > ------------------
>>>>>> > Date:             Wed Aug 30 11:20:04 2023
>>>>>> > Quorum provider:  corosync_votequorum
>>>>>> > Nodes:            3
>>>>>> > Node ID:          1
>>>>>> > Ring ID:          1/1538
>>>>>> > Quorate:          Yes
>>>>>> >
>>>>>> > Votequorum information
>>>>>> > ----------------------
>>>>>> > Expected votes:   3
>>>>>> > Highest expected: 3
>>>>>> > Total votes:      3
>>>>>> > Quorum:           2
>>>>>> > Flags:            Quorate WaitForAll LastManStanding AutoTieBreaker
>>>>>> >
>>>>>> > Membership information
>>>>>> > ----------------------
>>>>>> >     Nodeid      Votes    Qdevice Name
>>>>>> >          1          1         NR node1 (local)
>>>>>> >          2          1         NR node2
>>>>>> >          3          1         NR node3
>>>>>> >
>>>>>> > If I stop the cluster services on node 2 and 3, the groups all failover to
>>>>>> > node 1 since it is the node with the lowest ID
>>>>>> > But if I stop them on node1 and node 2 or node1 and node3, the cluster
>>>>>> > fails.
>>>>>> >
>>>>>> > I tried adding this line to corosync.conf and I could then bring down the
>>>>>> > services on node 1 and 2 or node 2 and 3 but if I left node 2 until last,
>>>>>> > the cluster failed
>>>>>> > auto_tie_breaker_node: 1  3
>>>>>> >
>>>>>> > This line had the same outcome as using 1 3
>>>>>> > auto_tie_breaker_node: 1  2 3
>>>>>> >
>>>>>> >
>>>>>> Giving multiple auto_tie_breaker-nodes doesn't make sense to me but rather
>>>>>> sounds dangerous if that configuration is possible at all.
>>>>>>
>>>>>> Maybe the misbehavior of last_man_standing is due to this (maybe not
>>>>>> recognized) misconfiguration.
>>>>>> Did you wait long enough between letting the 2 nodes fail?
>>>>>
>>>>> I've done it so many times so I believe so. But I'll try remove the auto_tie_breaker config, leaving the last_man_standing. I'll also make sure I leave a couple of minutes between bringing down the nodes and post back.
>>>>
>>>> Just confirming I removed the auto_tie_breaker config and tested. Quorum configuration is as follows:
>>>>  Options:
>>>>   last_man_standing: 1
>>>>   last_man_standing_window: 10000
>>>>   wait_for_all: 1
>>>>
>>>> I waited 2-3 minutes between stopping cluster services on two nodes via pcs cluster stop
>>>> The remaining cluster node is then fenced. I was hoping the remaining node would stay online running the resources.
>>>
>>>
>>> Yep - that would've been my understanding as well.
>>> But honestly I've never used last_man_standing in this context - wasn't even aware that it was
>>> offered without qdevice nor have I checked how it is implemented.
>>>
>>> Klaus
>>>>
>>>>
>>>>>>
>>>>>> Klaus
>>>>>>
>>>>>>
>>>>>> > So I'd like it to failover when any combination of two nodes fail but I've
>>>>>> > only had success when the middle node isn't last.
>>>>>> >
>>>>>> > Thanks
>>>>>> > David
>>>>>>
>>>>>>
>>>>>>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/