[ClusterLabs] issue during Pacemaker failover testing

Mon Sep 4 07:25:05 EDT 2023

On Mon, Sep 4, 2023 at 1:18 PM Klaus Wenninger <kwenning at redhat.com> wrote:

>
>
> On Mon, Sep 4, 2023 at 12:45 PM David Dolan <daithidolan at gmail.com> wrote:
>
>> Hi Klaus,
>>
>> With default quorum options I've performed the following on my 3 node
>> cluster
>>
>> Bring down cluster services on one node - the running services migrate to
>> another node
>> Wait 3 minutes
>> Bring down cluster services on one of the two remaining nodes - the
>> surviving node in the cluster is then fenced
>>
>> Instead of the surviving node being fenced, I hoped that the services
>> would migrate and run on that remaining node.
>>
>> Just looking for confirmation that my understanding is ok and if I'm
>> missing something?
>>
>
> As said I've never used it ...
> Well when down to 2 nodes LMS per definition is getting into trouble as
> after another
> outage any of them is gonna be alone. In case of an ordered shutdown this
> could
> possibly be circumvented though. So I guess your fist attempt to enable
> auto-tie-breaker
> was the right idea. Like this you will have further service at least on
> one of the nodes.
> So I guess what you were seeing is the right - and unfortunately only
> possible - behavior.
> Where LMS shines is probably scenarios with substantially more nodes.
>

Or go for qdevice with LMS where I would expect it to be able to really go
down to
a single node left - any of the 2 last ones - as there is still qdevice.#
Sry for the confusion btw.

Klaus

>
> Klaus
>
>>
>> Thanks
>> David
>>
>>
>>
>> On Thu, 31 Aug 2023 at 11:59, David Dolan <daithidolan at gmail.com> wrote:
>>
>>> I just tried removing all the quorum options setting back to defaults so
>>> no last_man_standing or wait_for_all.
>>> I still see the same behaviour where the third node is fenced if I bring
>>> down services on two nodes.
>>> Thanks
>>> David
>>>
>>> On Thu, 31 Aug 2023 at 11:44, Klaus Wenninger <kwenning at redhat.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Thu, Aug 31, 2023 at 12:28 PM David Dolan <daithidolan at gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Wed, 30 Aug 2023 at 17:35, David Dolan <daithidolan at gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> > Hi All,
>>>>>>> >
>>>>>>> > I'm running Pacemaker on Centos7
>>>>>>> > Name        : pcs
>>>>>>> > Version     : 0.9.169
>>>>>>> > Release     : 3.el7.centos.3
>>>>>>> > Architecture: x86_64
>>>>>>> >
>>>>>>> >
>>>>>>> Besides the pcs-version versions of the other
>>>>>>> cluster-stack-components
>>>>>>> could be interesting. (pacemaker, corosync)
>>>>>>>
>>>>>>  rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents"
>>>>>> fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64
>>>>>> corosynclib-2.4.5-7.el7_9.2.x86_64
>>>>>> pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64
>>>>>> fence-agents-common-4.2.1-41.el7_9.6.x86_64
>>>>>> corosync-2.4.5-7.el7_9.2.x86_64
>>>>>> pacemaker-cli-1.1.23-1.el7_9.1.x86_64
>>>>>> pacemaker-1.1.23-1.el7_9.1.x86_64
>>>>>> pcs-0.9.169-3.el7.centos.3.x86_64
>>>>>> pacemaker-libs-1.1.23-1.el7_9.1.x86_64
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> > I'm performing some cluster failover tests in a 3 node cluster. We
>>>>>>> have 3
>>>>>>> > resources in the cluster.
>>>>>>> > I was trying to see if I could get it working if 2 nodes fail at
>>>>>>> different
>>>>>>> > times. I'd like the 3 resources to then run on one node.
>>>>>>> >
>>>>>>> > The quorum options I've configured are as follows
>>>>>>> > [root at node1 ~]# pcs quorum config
>>>>>>> > Options:
>>>>>>> >   auto_tie_breaker: 1
>>>>>>> >   last_man_standing: 1
>>>>>>> >   last_man_standing_window: 10000
>>>>>>> >   wait_for_all: 1
>>>>>>> >
>>>>>>> >
>>>>>>> Not sure if the combination of auto_tie_breaker and
>>>>>>> last_man_standing makes
>>>>>>> sense.
>>>>>>> And as you have a cluster with an odd number of nodes
>>>>>>> auto_tie_breaker
>>>>>>> should be
>>>>>>> disabled anyway I guess.
>>>>>>>
>>>>>> Ah ok I'll try removing auto_tie_breaker and leave last_man_standing
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> > [root at node1 ~]# pcs quorum status
>>>>>>> > Quorum information
>>>>>>> > ------------------
>>>>>>> > Date:             Wed Aug 30 11:20:04 2023
>>>>>>> > Quorum provider:  corosync_votequorum
>>>>>>> > Nodes:            3
>>>>>>> > Node ID:          1
>>>>>>> > Ring ID:          1/1538
>>>>>>> > Quorate:          Yes
>>>>>>> >
>>>>>>> > Votequorum information
>>>>>>> > ----------------------
>>>>>>> > Expected votes:   3
>>>>>>> > Highest expected: 3
>>>>>>> > Total votes:      3
>>>>>>> > Quorum:           2
>>>>>>> > Flags:            Quorate WaitForAll LastManStanding AutoTieBreaker
>>>>>>> >
>>>>>>> > Membership information
>>>>>>> > ----------------------
>>>>>>> >     Nodeid      Votes    Qdevice Name
>>>>>>> >          1          1         NR node1 (local)
>>>>>>> >          2          1         NR node2
>>>>>>> >          3          1         NR node3
>>>>>>> >
>>>>>>> > If I stop the cluster services on node 2 and 3, the groups all
>>>>>>> failover to
>>>>>>> > node 1 since it is the node with the lowest ID
>>>>>>> > But if I stop them on node1 and node 2 or node1 and node3, the
>>>>>>> cluster
>>>>>>> > fails.
>>>>>>> >
>>>>>>> > I tried adding this line to corosync.conf and I could then bring
>>>>>>> down the
>>>>>>> > services on node 1 and 2 or node 2 and 3 but if I left node 2
>>>>>>> until last,
>>>>>>> > the cluster failed
>>>>>>> > auto_tie_breaker_node: 1  3
>>>>>>> >
>>>>>>> > This line had the same outcome as using 1 3
>>>>>>> > auto_tie_breaker_node: 1  2 3
>>>>>>> >
>>>>>>> >
>>>>>>> Giving multiple auto_tie_breaker-nodes doesn't make sense to me but
>>>>>>> rather
>>>>>>> sounds dangerous if that configuration is possible at all.
>>>>>>>
>>>>>>> Maybe the misbehavior of last_man_standing is due to this (maybe not
>>>>>>> recognized) misconfiguration.
>>>>>>> Did you wait long enough between letting the 2 nodes fail?
>>>>>>>
>>>>>> I've done it so many times so I believe so. But I'll try remove the
>>>>>> auto_tie_breaker config, leaving the last_man_standing. I'll also make sure
>>>>>> I leave a couple of minutes between bringing down the nodes and post back.
>>>>>>
>>>>> Just confirming I removed the auto_tie_breaker config and tested.
>>>>> Quorum configuration is as follows:
>>>>>  Options:
>>>>>   last_man_standing: 1
>>>>>   last_man_standing_window: 10000
>>>>>   wait_for_all: 1
>>>>>
>>>>> I waited 2-3 minutes between stopping cluster services on two nodes
>>>>> via pcs cluster stop
>>>>> The remaining cluster node is then fenced. I was hoping the remaining
>>>>> node would stay online running the resources.
>>>>>
>>>>
>>>> Yep - that would've been my understanding as well.
>>>> But honestly I've never used last_man_standing in this context - wasn't
>>>> even aware that it was
>>>> offered without qdevice nor have I checked how it is implemented.
>>>>
>>>> Klaus
>>>>
>>>>>
>>>>>
>>>>>>> Klaus
>>>>>>>
>>>>>>>
>>>>>>> > So I'd like it to failover when any combination of two nodes fail
>>>>>>> but I've
>>>>>>> > only had success when the middle node isn't last.
>>>>>>> >
>>>>>>> > Thanks
>>>>>>> > David
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20230904/dc5c381f/attachment.htm>