[ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution

Tue Jun 27 15:16:00 EDT 2023

On Tue, 2023-06-27 at 22:38 +0530, Priyanka Balotra wrote:
> In this case stonith has been configured as a resource, 
> primitive stonith-sbd stonith:external/sbd
> 
> For it to be functional properly , the resource needs to be up, which
> is only possible if the system is quorate.

Pacemaker can use a fence device even if its resource is not active.
The resource being active just allows Pacemaker to monitor the device
regularly.

> 
> Hence our requirement is to make the system quorate even if one Node
> of the cluster is up.
> Stonith will then take care of any split-brain scenarios. 

In that case it sounds like no-quorum-policy=ignore is actually what
you want.

> 
> Thanks
> Priyanka
> 
> On Tue, Jun 27, 2023 at 9:06 PM Klaus Wenninger <kwenning at redhat.com>
> wrote:
> > 
> > On Tue, Jun 27, 2023 at 5:24 PM Andrei Borzenkov <
> > arvidjaar at gmail.com> wrote:
> > > On 27.06.2023 07:21, Priyanka Balotra wrote:
> > > > Hi Andrei,
> > > > After this state the system went through some more fencings and
> > > we saw the
> > > > following state:
> > > > 
> > > > :~ # crm status
> > > > Cluster Summary:
> > > >    * Stack: corosync
> > > >    * Current DC: FILE-2 (version
> > > > 2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36)
> > > - partition
> > > > with quorum
> > > 
> > > It says "partition with quorum" so what exactly is the problem?
> > 
> > I guess the problem is that resources aren't being recovered on
> > the nodes in the quorate partition.
> > Reason for that is probably that - as Ken was already suggesting -
> > fencing isn't
> > working properly or fencing-devices used are simply inappropriate
> > for the 
> > purpose (e.g. onboard IPMI).
> > The fact that a node is rebooting isn't enough. The node that
> > initiated fencing
> > has to know that it did actually work. But we're just guessing
> > here. Logs should
> > show what is actually going on.
> > 
> > Klaus
> > > >    * Last updated: Mon Jun 26 12:44:15 2023
> > > >    * Last change:  Mon Jun 26 12:41:12 2023 by root via
> > > cibadmin on FILE-2
> > > >    * 4 nodes configured
> > > >    * 11 resource instances configured
> > > > 
> > > > Node List:
> > > >    * Node FILE-1: UNCLEAN (offline)
> > > >    * Node FILE-4: UNCLEAN (offline)
> > > >    * Online: [ FILE-2 ]
> > > >    * Online: [ FILE-3 ]
> > > > 
> > > > At this stage FILE-1 and FILE-4 were continuously getting
> > > fenced (we have
> > > > device based stonith configured but the resource was not up ) .
> > > > Two nodes were online and two were offline. So quorum wasn't
> > > attained
> > > > again.
> > > > 1)  For such a scenario we need help to be able to have one
> > > cluster live .
> > > > 2)  And in cases where only one node of the cluster is up and
> > > others are
> > > > down we need the resources and cluster to be up .
> > > > 
> > > > Thanks
> > > > Priyanka
> > > > 
> > > > On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov <
> > > arvidjaar at gmail.com>
> > > > wrote:
> > > > 
> > > >> On 26.06.2023 21:14, Priyanka Balotra wrote:
> > > >>> Hi All,
> > > >>> We are seeing an issue where we replaced no-quorum-
> > > policy=ignore with
> > > >> other
> > > >>> options in corosync.conf order to simulate the same behaviour
> > > :
> > > >>>
> > > >>>
> > > >>> *     wait_for_all: 0*
> > > >>>
> > > >>> *        last_man_standing: 1       
> > > last_man_standing_window: 20000*
> > > >>>
> > > >>> There was another property (auto-tie-breaker) tried but
> > > couldn't
> > > >> configure
> > > >>> it as crm did not recognise this property.
> > > >>>
> > > >>> But even after using these options, we are seeing that system
> > > is not
> > > >>> quorate if at least half of the nodes are not up.
> > > >>>
> > > >>> Some properties from crm config are as follows:
> > > >>>
> > > >>>
> > > >>>
> > > >>> *primitive stonith-sbd stonith:external/sbd \        params
> > > >>> pcmk_delay_base=5s.*
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> *.property cib-bootstrap-options: \        have-watchdog=true 
> > > \
> > > >>>
> > > >> dc-version="2.1.2+20211124.ada5c3b36-150400.2.43-
> > > 2.1.2+20211124.ada5c3b36"
> > > >>> \        cluster-infrastructure=corosync \        cluster-
> > > name=FILE \
> > > >>>     stonith-enabled=true \        stonith-timeout=172 \
> > > >>> stonith-action=reboot \        stop-all-resources=false \
> > > >>> no-quorum-policy=ignorersc_defaults build-resource-defaults:
> > > \
> > > >>> resource-stickiness=1rsc_defaults rsc-options: \
> > > >>> resource-stickiness=100 \        migration-threshold=3 \
> > > >>> failure-timeout=1m \        cluster-recheck-
> > > interval=10minop_defaults
> > > >>> op-options: \        timeout=600 \        record-
> > > pending=true*
> > > >>>
> > > >>> On a 4-node setup when the whole cluster is brought up
> > > together we see
> > > >>> error logs like:
> > > >>>
> > > >>> *2023-06-26T11:35:17.231104+00:00 FILE-1 pacemaker-
> > > schedulerd[26359]:
> > > >>> warning: Fencing and resource management disabled due to lack
> > > of quorum*
> > > >>>
> > > >>> *2023-06-26T11:35:17.231338+00:00 FILE-1 pacemaker-
> > > schedulerd[26359]:
> > > >>> warning: Ignoring malformed node_state entry without uname*
> > > >>>
> > > >>> *2023-06-26T11:35:17.233771+00:00 FILE-1 pacemaker-
> > > schedulerd[26359]:
> > > >>> warning: Node FILE-2 is unclean!*
> > > >>>
> > > >>> *2023-06-26T11:35:17.233857+00:00 FILE-1 pacemaker-
> > > schedulerd[26359]:
> > > >>> warning: Node FILE-3 is unclean!*
> > > >>>
> > > >>> *2023-06-26T11:35:17.233957+00:00 FILE-1 pacemaker-
> > > schedulerd[26359]:
> > > >>> warning: Node FILE-4 is unclean!*
> > > >>>
> > > >>
> > > >> According to this output FILE-1 lost connection to three other
> > > nodes, in
> > > >> which case it cannot be quorate.
> > > >>
> > > >>>
> > > >>> Kindly help correct the configuration to make the system
> > > function
> > > >> normally
> > > >>> with all resources up, even if there is just one node up.
> > > >>>
> > > >>> Please let me know if any more info is needed.
> > > >>>
> > > >>> Thanks
> > > >>> Priyanka
-- 
Ken Gaillot <kgaillot at redhat.com>