[ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution

Tue Jun 27 11:35:57 EDT 2023

On Tue, Jun 27, 2023 at 5:24 PM Andrei Borzenkov <arvidjaar at gmail.com>
wrote:

> On 27.06.2023 07:21, Priyanka Balotra wrote:
> > Hi Andrei,
> > After this state the system went through some more fencings and we saw
> the
> > following state:
> >
> > :~ # crm status
> > Cluster Summary:
> >    * Stack: corosync
> >    * Current DC: FILE-2 (version
> > 2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36) -
> partition
> > with quorum
>
> It says "partition with quorum" so what exactly is the problem?
>

I guess the problem is that resources aren't being recovered on
the nodes in the quorate partition.
Reason for that is probably that - as Ken was already suggesting - fencing
isn't
working properly or fencing-devices used are simply inappropriate for the
purpose (e.g. onboard IPMI).
The fact that a node is rebooting isn't enough. The node that initiated
fencing
has to know that it did actually work. But we're just guessing here. Logs
should
show what is actually going on.

Klaus

>
> >    * Last updated: Mon Jun 26 12:44:15 2023
> >    * Last change:  Mon Jun 26 12:41:12 2023 by root via cibadmin on
> FILE-2
> >    * 4 nodes configured
> >    * 11 resource instances configured
> >
> > Node List:
> >    * Node FILE-1: UNCLEAN (offline)
> >    * Node FILE-4: UNCLEAN (offline)
> >    * Online: [ FILE-2 ]
> >    * Online: [ FILE-3 ]
> >
> > At this stage FILE-1 and FILE-4 were continuously getting fenced (we have
> > device based stonith configured but the resource was not up ) .
> > Two nodes were online and two were offline. So quorum wasn't attained
> > again.
> > 1)  For such a scenario we need help to be able to have one cluster live
> .
> > 2)  And in cases where only one node of the cluster is up and others are
> > down we need the resources and cluster to be up .
> >
> > Thanks
> > Priyanka
> >
> > On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov <arvidjaar at gmail.com>
> > wrote:
> >
> >> On 26.06.2023 21:14, Priyanka Balotra wrote:
> >>> Hi All,
> >>> We are seeing an issue where we replaced no-quorum-policy=ignore with
> >> other
> >>> options in corosync.conf order to simulate the same behaviour :
> >>>
> >>>
> >>> *     wait_for_all: 0*
> >>>
> >>> *        last_man_standing: 1        last_man_standing_window: 20000*
> >>>
> >>> There was another property (auto-tie-breaker) tried but couldn't
> >> configure
> >>> it as crm did not recognise this property.
> >>>
> >>> But even after using these options, we are seeing that system is not
> >>> quorate if at least half of the nodes are not up.
> >>>
> >>> Some properties from crm config are as follows:
> >>>
> >>>
> >>>
> >>> *primitive stonith-sbd stonith:external/sbd \        params
> >>> pcmk_delay_base=5s.*
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> *.property cib-bootstrap-options: \        have-watchdog=true \
> >>>
> >>
> dc-version="2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36"
> >>> \        cluster-infrastructure=corosync \        cluster-name=FILE \
> >>>     stonith-enabled=true \        stonith-timeout=172 \
> >>> stonith-action=reboot \        stop-all-resources=false \
> >>> no-quorum-policy=ignorersc_defaults build-resource-defaults: \
> >>> resource-stickiness=1rsc_defaults rsc-options: \
> >>> resource-stickiness=100 \        migration-threshold=3 \
> >>> failure-timeout=1m \        cluster-recheck-interval=10minop_defaults
> >>> op-options: \        timeout=600 \        record-pending=true*
> >>>
> >>> On a 4-node setup when the whole cluster is brought up together we see
> >>> error logs like:
> >>>
> >>> *2023-06-26T11:35:17.231104+00:00 FILE-1 pacemaker-schedulerd[26359]:
> >>> warning: Fencing and resource management disabled due to lack of
> quorum*
> >>>
> >>> *2023-06-26T11:35:17.231338+00:00 FILE-1 pacemaker-schedulerd[26359]:
> >>> warning: Ignoring malformed node_state entry without uname*
> >>>
> >>> *2023-06-26T11:35:17.233771+00:00 FILE-1 pacemaker-schedulerd[26359]:
> >>> warning: Node FILE-2 is unclean!*
> >>>
> >>> *2023-06-26T11:35:17.233857+00:00 FILE-1 pacemaker-schedulerd[26359]:
> >>> warning: Node FILE-3 is unclean!*
> >>>
> >>> *2023-06-26T11:35:17.233957+00:00 FILE-1 pacemaker-schedulerd[26359]:
> >>> warning: Node FILE-4 is unclean!*
> >>>
> >>
> >> According to this output FILE-1 lost connection to three other nodes, in
> >> which case it cannot be quorate.
> >>
> >>>
> >>> Kindly help correct the configuration to make the system function
> >> normally
> >>> with all resources up, even if there is just one node up.
> >>>
> >>> Please let me know if any more info is needed.
> >>>
> >>> Thanks
> >>> Priyanka
> >>>
> >>>
> >>> _______________________________________________
> >>> Manage your subscription:
> >>> https://lists.clusterlabs.org/mailman/listinfo/users
> >>>
> >>> ClusterLabs home: https://www.clusterlabs.org/
> >>
> >> _______________________________________________
> >> Manage your subscription:
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> ClusterLabs home: https://www.clusterlabs.org/
> >>
> >
> >
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20230627/98317bc5/attachment.htm>