[ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution
Ken Gaillot
kgaillot at redhat.com
Tue Jun 27 10:30:24 EDT 2023
On Tue, 2023-06-27 at 09:51 +0530, Priyanka Balotra wrote:
> Hi Andrei,
> After this state the system went through some more fencings and we
> saw the following state:
>
> :~ # crm status
> Cluster Summary:
> * Stack: corosync
> * Current DC: FILE-2 (version 2.1.2+20211124.ada5c3b36-150400.2.43-
> 2.1.2+20211124.ada5c3b36) - partition with quorum
> * Last updated: Mon Jun 26 12:44:15 2023
> * Last change: Mon Jun 26 12:41:12 2023 by root via cibadmin on
> FILE-2
> * 4 nodes configured
> * 11 resource instances configured
>
> Node List:
> * Node FILE-1: UNCLEAN (offline)
> * Node FILE-4: UNCLEAN (offline)
> * Online: [ FILE-2 ]
> * Online: [ FILE-3 ]
>
> At this stage FILE-1 and FILE-4 were continuously getting fenced (we
> have device based stonith configured but the resource was not up ) .
> Two nodes were online and two were offline. So quorum wasn't attained
> again.
> 1) For such a scenario we need help to be able to have one cluster
> live .
> 2) And in cases where only one node of the cluster is up and others
> are down we need the resources and cluster to be up .
The solution is to fix the fencing.
Without fencing, there is no way to know that the other nodes are
*actually* offline. It's possible that communication between the nodes
has been temporarily interrupted, in which case recovering resources
could lead to a "split-brain" situation that could corrupt data or make
services unusable.
Onboard IPMI is not a production fencing mechanism by itself, because
it loses power when the node loses power. It's fine to use in a
topology with a fallback method such as power fencing or watchdog-based
SBD.
> Thanks
> Priyanka
>
> On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov <
> arvidjaar at gmail.com> wrote:
> > On 26.06.2023 21:14, Priyanka Balotra wrote:
> > > Hi All,
> > > We are seeing an issue where we replaced no-quorum-policy=ignore
> > with other
> > > options in corosync.conf order to simulate the same behaviour :
> > >
> > >
> > > * wait_for_all: 0*
> > >
> > > * last_man_standing: 1 last_man_standing_window:
> > 20000*
> > >
> > > There was another property (auto-tie-breaker) tried but couldn't
> > configure
> > > it as crm did not recognise this property.
> > >
> > > But even after using these options, we are seeing that system is
> > not
> > > quorate if at least half of the nodes are not up.
> > >
> > > Some properties from crm config are as follows:
> > >
> > >
> > >
> > > *primitive stonith-sbd stonith:external/sbd \ params
> > > pcmk_delay_base=5s.*
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > *.property cib-bootstrap-options: \ have-watchdog=true \
> > > dc-version="2.1.2+20211124.ada5c3b36-150400.2.43-
> > 2.1.2+20211124.ada5c3b36"
> > > \ cluster-infrastructure=corosync \ cluster-
> > name=FILE \
> > > stonith-enabled=true \ stonith-timeout=172 \
> > > stonith-action=reboot \ stop-all-resources=false \
> > > no-quorum-policy=ignorersc_defaults build-resource-defaults: \
> > > resource-stickiness=1rsc_defaults rsc-options: \
> > > resource-stickiness=100 \ migration-threshold=3 \
> > > failure-timeout=1m \ cluster-recheck-
> > interval=10minop_defaults
> > > op-options: \ timeout=600 \ record-pending=true*
> > >
> > > On a 4-node setup when the whole cluster is brought up together
> > we see
> > > error logs like:
> > >
> > > *2023-06-26T11:35:17.231104+00:00 FILE-1 pacemaker-
> > schedulerd[26359]:
> > > warning: Fencing and resource management disabled due to lack of
> > quorum*
> > >
> > > *2023-06-26T11:35:17.231338+00:00 FILE-1 pacemaker-
> > schedulerd[26359]:
> > > warning: Ignoring malformed node_state entry without uname*
> > >
> > > *2023-06-26T11:35:17.233771+00:00 FILE-1 pacemaker-
> > schedulerd[26359]:
> > > warning: Node FILE-2 is unclean!*
> > >
> > > *2023-06-26T11:35:17.233857+00:00 FILE-1 pacemaker-
> > schedulerd[26359]:
> > > warning: Node FILE-3 is unclean!*
> > >
> > > *2023-06-26T11:35:17.233957+00:00 FILE-1 pacemaker-
> > schedulerd[26359]:
> > > warning: Node FILE-4 is unclean!*
> > >
> >
> > According to this output FILE-1 lost connection to three other
> > nodes, in
> > which case it cannot be quorate.
> >
> > >
> > > Kindly help correct the configuration to make the system function
> > normally
> > > with all resources up, even if there is just one node up.
> > >
> > > Please let me know if any more info is needed.
> > >
> > > Thanks
> > > Priyanka
> > >
> > >
> > > _______________________________________________
> > > Manage your subscription:
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > > ClusterLabs home: https://www.clusterlabs.org/
> >
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list