[ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution

Priyanka Balotra priyanka.14balotra at gmail.com
Tue Jun 27 13:08:08 EDT 2023


In this case stonith has been configured as a resource,
*primitive stonith-sbd stonith:external/sbd*

For it to be functional properly , the resource needs to be up, which is
only possible if the system is quorate.
Hence our requirement is to make the system quorate even if one Node of the
cluster is up.
Stonith will then take care of any split-brain scenarios.

Thanks
Priyanka

On Tue, Jun 27, 2023 at 9:06 PM Klaus Wenninger <kwenning at redhat.com> wrote:

>
>
> On Tue, Jun 27, 2023 at 5:24 PM Andrei Borzenkov <arvidjaar at gmail.com>
> wrote:
>
>> On 27.06.2023 07:21, Priyanka Balotra wrote:
>> > Hi Andrei,
>> > After this state the system went through some more fencings and we saw
>> the
>> > following state:
>> >
>> > :~ # crm status
>> > Cluster Summary:
>> >    * Stack: corosync
>> >    * Current DC: FILE-2 (version
>> > 2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36) -
>> partition
>> > with quorum
>>
>> It says "partition with quorum" so what exactly is the problem?
>>
>
> I guess the problem is that resources aren't being recovered on
> the nodes in the quorate partition.
> Reason for that is probably that - as Ken was already suggesting - fencing
> isn't
> working properly or fencing-devices used are simply inappropriate for the
> purpose (e.g. onboard IPMI).
> The fact that a node is rebooting isn't enough. The node that initiated
> fencing
> has to know that it did actually work. But we're just guessing here. Logs
> should
> show what is actually going on.
>
> Klaus
>
>>
>> >    * Last updated: Mon Jun 26 12:44:15 2023
>> >    * Last change:  Mon Jun 26 12:41:12 2023 by root via cibadmin on
>> FILE-2
>> >    * 4 nodes configured
>> >    * 11 resource instances configured
>> >
>> > Node List:
>> >    * Node FILE-1: UNCLEAN (offline)
>> >    * Node FILE-4: UNCLEAN (offline)
>> >    * Online: [ FILE-2 ]
>> >    * Online: [ FILE-3 ]
>> >
>> > At this stage FILE-1 and FILE-4 were continuously getting fenced (we
>> have
>> > device based stonith configured but the resource was not up ) .
>> > Two nodes were online and two were offline. So quorum wasn't attained
>> > again.
>> > 1)  For such a scenario we need help to be able to have one cluster
>> live .
>> > 2)  And in cases where only one node of the cluster is up and others are
>> > down we need the resources and cluster to be up .
>> >
>> > Thanks
>> > Priyanka
>> >
>> > On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov <arvidjaar at gmail.com>
>> > wrote:
>> >
>> >> On 26.06.2023 21:14, Priyanka Balotra wrote:
>> >>> Hi All,
>> >>> We are seeing an issue where we replaced no-quorum-policy=ignore with
>> >> other
>> >>> options in corosync.conf order to simulate the same behaviour :
>> >>>
>> >>>
>> >>> *     wait_for_all: 0*
>> >>>
>> >>> *        last_man_standing: 1        last_man_standing_window: 20000*
>> >>>
>> >>> There was another property (auto-tie-breaker) tried but couldn't
>> >> configure
>> >>> it as crm did not recognise this property.
>> >>>
>> >>> But even after using these options, we are seeing that system is not
>> >>> quorate if at least half of the nodes are not up.
>> >>>
>> >>> Some properties from crm config are as follows:
>> >>>
>> >>>
>> >>>
>> >>> *primitive stonith-sbd stonith:external/sbd \        params
>> >>> pcmk_delay_base=5s.*
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> *.property cib-bootstrap-options: \        have-watchdog=true \
>> >>>
>> >>
>> dc-version="2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36"
>> >>> \        cluster-infrastructure=corosync \        cluster-name=FILE \
>> >>>     stonith-enabled=true \        stonith-timeout=172 \
>> >>> stonith-action=reboot \        stop-all-resources=false \
>> >>> no-quorum-policy=ignorersc_defaults build-resource-defaults: \
>> >>> resource-stickiness=1rsc_defaults rsc-options: \
>> >>> resource-stickiness=100 \        migration-threshold=3 \
>> >>> failure-timeout=1m \        cluster-recheck-interval=10minop_defaults
>> >>> op-options: \        timeout=600 \        record-pending=true*
>> >>>
>> >>> On a 4-node setup when the whole cluster is brought up together we see
>> >>> error logs like:
>> >>>
>> >>> *2023-06-26T11:35:17.231104+00:00 FILE-1 pacemaker-schedulerd[26359]:
>> >>> warning: Fencing and resource management disabled due to lack of
>> quorum*
>> >>>
>> >>> *2023-06-26T11:35:17.231338+00:00 FILE-1 pacemaker-schedulerd[26359]:
>> >>> warning: Ignoring malformed node_state entry without uname*
>> >>>
>> >>> *2023-06-26T11:35:17.233771+00:00 FILE-1 pacemaker-schedulerd[26359]:
>> >>> warning: Node FILE-2 is unclean!*
>> >>>
>> >>> *2023-06-26T11:35:17.233857+00:00 FILE-1 pacemaker-schedulerd[26359]:
>> >>> warning: Node FILE-3 is unclean!*
>> >>>
>> >>> *2023-06-26T11:35:17.233957+00:00 FILE-1 pacemaker-schedulerd[26359]:
>> >>> warning: Node FILE-4 is unclean!*
>> >>>
>> >>
>> >> According to this output FILE-1 lost connection to three other nodes,
>> in
>> >> which case it cannot be quorate.
>> >>
>> >>>
>> >>> Kindly help correct the configuration to make the system function
>> >> normally
>> >>> with all resources up, even if there is just one node up.
>> >>>
>> >>> Please let me know if any more info is needed.
>> >>>
>> >>> Thanks
>> >>> Priyanka
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> Manage your subscription:
>> >>> https://lists.clusterlabs.org/mailman/listinfo/users
>> >>>
>> >>> ClusterLabs home: https://www.clusterlabs.org/
>> >>
>> >> _______________________________________________
>> >> Manage your subscription:
>> >> https://lists.clusterlabs.org/mailman/listinfo/users
>> >>
>> >> ClusterLabs home: https://www.clusterlabs.org/
>> >>
>> >
>> >
>> > _______________________________________________
>> > Manage your subscription:
>> > https://lists.clusterlabs.org/mailman/listinfo/users
>> >
>> > ClusterLabs home: https://www.clusterlabs.org/
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20230627/3c21c8ae/attachment.htm>


More information about the Users mailing list