<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Jun 28, 2023 at 7:38 AM Klaus Wenninger <<a href="mailto:kwenning@redhat.com">kwenning@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Jun 28, 2023 at 3:30 AM Priyanka Balotra <<a href="mailto:priyanka.14balotra@gmail.com" target="_blank">priyanka.14balotra@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto">I am using SLES 15 SP4. Is the no-quorum-policy still supported?</div><div dir="auto"> <br></div></blockquote></div></div></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto"></div><div dir="auto">Thanks</div><div dir="auto">Priyanka</div><div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, 28 Jun 2023 at 12:46 AM, Ken Gaillot <<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Tue, 2023-06-27 at 22:38 +0530, Priyanka Balotra wrote:<br>

> In this case stonith has been configured as a resource, <br>

> primitive stonith-sbd stonith:external/sbd<br></blockquote></div></div></blockquote><div><br></div><div>Then the error scenario you described looks like everybody lost connection</div><div>to the shared-storage. The nodes rebooting then probably rather suicided</div><div>instead of reading the poison-pill. And the quorate partition is staying alive because</div><div>it is quorate but not seeing the shared-storage it can't verify that it had been</div><div>able to write the poison-pill which makes the other nodes stay unclean.</div><div>But again just guessing ...</div></div></div></blockquote><div><br></div><div>That said and without knowing details about your scenario and the</div><div>failure-scenarios you want to cover you might consider watchdog-fencing.</div><div>afaik Suse does support that as well for a while now.</div><div>It gives you service-recovery from nodes that are cut off via network</div><div>including their physical fencing-devices. I know that poison-pill-fencing</div><div>should do that as well as long as the quorate part of the cluster is able</div><div>to access the shared-disk but in your scenario this doesn't seem to be</div><div>the case.</div><div>Just out of curiosity: Are you using poison-pill with multiple shared disks?</div><div>Asking as in that case the poison-pill may still be passed via a single disk</div><div>and the target would reboot but the other side that initiated fencing might</div><div>not recover resources as it might not have been able to write the poison-pill</div><div>to a quorate number of disks.</div><div><br></div><div>Klaus</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

> <br>

> For it to be functional properly , the resource needs to be up, which<br>

> is only possible if the system is quorate.<br>

<br>

Pacemaker can use a fence device even if its resource is not active.<br>

The resource being active just allows Pacemaker to monitor the device<br>

regularly.<br>

<br>

> <br>

> Hence our requirement is to make the system quorate even if one Node<br>

> of the cluster is up.<br>

> Stonith will then take care of any split-brain scenarios. <br>

<br>

In that case it sounds like no-quorum-policy=ignore is actually what<br>

you want.<br></blockquote></div></div></blockquote><div><br></div><div>Still dangerous without something like wait-for-all - right?</div><div>With LMS I guess you should have the same effect without having explicitly</div><div>specified though.</div><div><br></div><div>Klaus</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

> <br>

> Thanks<br>

> Priyanka<br>

> <br>

> On Tue, Jun 27, 2023 at 9:06 PM Klaus Wenninger <<a href="mailto:kwenning@redhat.com" target="_blank">kwenning@redhat.com</a>><br>

> wrote:<br>

> > <br>

> > On Tue, Jun 27, 2023 at 5:24 PM Andrei Borzenkov <<br>

> > <a href="mailto:arvidjaar@gmail.com" target="_blank">arvidjaar@gmail.com</a>> wrote:<br>

> > > On 27.06.2023 07:21, Priyanka Balotra wrote:<br>

> > > > Hi Andrei,<br>

> > > > After this state the system went through some more fencings and<br>

> > > we saw the<br>

> > > > following state:<br>

> > > > <br>

> > > > :~ # crm status<br>

> > > > Cluster Summary:<br>

> > > >    * Stack: corosync<br>

> > > >    * Current DC: FILE-2 (version<br>

> > > > 2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36)<br>

> > > - partition<br>

> > > > with quorum<br>

> > > <br>

> > > It says "partition with quorum" so what exactly is the problem?<br>

> > <br>

> > I guess the problem is that resources aren't being recovered on<br>

> > the nodes in the quorate partition.<br>

> > Reason for that is probably that - as Ken was already suggesting -<br>

> > fencing isn't<br>

> > working properly or fencing-devices used are simply inappropriate<br>

> > for the <br>

> > purpose (e.g. onboard IPMI).<br>

> > The fact that a node is rebooting isn't enough. The node that<br>

> > initiated fencing<br>

> > has to know that it did actually work. But we're just guessing<br>

> > here. Logs should<br>

> > show what is actually going on.<br>

> > <br>

> > Klaus<br>

> > > >    * Last updated: Mon Jun 26 12:44:15 2023<br>

> > > >    * Last change:  Mon Jun 26 12:41:12 2023 by root via<br>

> > > cibadmin on FILE-2<br>

> > > >    * 4 nodes configured<br>

> > > >    * 11 resource instances configured<br>

> > > > <br>

> > > > Node List:<br>

> > > >    * Node FILE-1: UNCLEAN (offline)<br>

> > > >    * Node FILE-4: UNCLEAN (offline)<br>

> > > >    * Online: [ FILE-2 ]<br>

> > > >    * Online: [ FILE-3 ]<br>

> > > > <br>

> > > > At this stage FILE-1 and FILE-4 were continuously getting<br>

> > > fenced (we have<br>

> > > > device based stonith configured but the resource was not up ) .<br>

> > > > Two nodes were online and two were offline. So quorum wasn't<br>

> > > attained<br>

> > > > again.<br>

> > > > 1)  For such a scenario we need help to be able to have one<br>

> > > cluster live .<br>

> > > > 2)  And in cases where only one node of the cluster is up and<br>

> > > others are<br>

> > > > down we need the resources and cluster to be up .<br>

> > > > <br>

> > > > Thanks<br>

> > > > Priyanka<br>

> > > > <br>

> > > > On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov <<br>

> > > <a href="mailto:arvidjaar@gmail.com" target="_blank">arvidjaar@gmail.com</a>><br>

> > > > wrote:<br>

> > > > <br>

> > > >> On 26.06.2023 21:14, Priyanka Balotra wrote:<br>

> > > >>> Hi All,<br>

> > > >>> We are seeing an issue where we replaced no-quorum-<br>

> > > policy=ignore with<br>

> > > >> other<br>

> > > >>> options in corosync.conf order to simulate the same behaviour<br>

> > > :<br>

> > > >>><br>

> > > >>><br>

> > > >>> *     wait_for_all: 0*<br>

> > > >>><br>

> > > >>> *        last_man_standing: 1       <br>

> > > last_man_standing_window: 20000*<br>

> > > >>><br>

> > > >>> There was another property (auto-tie-breaker) tried but<br>

> > > couldn't<br>

> > > >> configure<br>

> > > >>> it as crm did not recognise this property.<br>

> > > >>><br>

> > > >>> But even after using these options, we are seeing that system<br>

> > > is not<br>

> > > >>> quorate if at least half of the nodes are not up.<br>

> > > >>><br>

> > > >>> Some properties from crm config are as follows:<br>

> > > >>><br>

> > > >>><br>

> > > >>><br>

> > > >>> *primitive stonith-sbd stonith:external/sbd \        params<br>

> > > >>> pcmk_delay_base=5s.*<br>

> > > >>><br>

> > > >>><br>

> > > >>><br>

> > > >>><br>

> > > >>><br>

> > > >>><br>

> > > >>><br>

> > > >>><br>

> > > >>><br>

> > > >>><br>

> > > >>><br>

> > > >>><br>

> > > >>><br>

> > > >>><br>

> > > >>><br>

> > > >>><br>

> > > >>><br>

> > > >>><br>

> > > >>><br>

> > > >>><br>

> > > >>> *.property cib-bootstrap-options: \        have-watchdog=true <br>

> > > \<br>

> > > >>><br>

> > > >> dc-version="2.1.2+20211124.ada5c3b36-150400.2.43-<br>

> > > 2.1.2+20211124.ada5c3b36"<br>

> > > >>> \        cluster-infrastructure=corosync \        cluster-<br>

> > > name=FILE \<br>

> > > >>>     stonith-enabled=true \        stonith-timeout=172 \<br>

> > > >>> stonith-action=reboot \        stop-all-resources=false \<br>

> > > >>> no-quorum-policy=ignorersc_defaults build-resource-defaults:<br>

> > > \<br>

> > > >>> resource-stickiness=1rsc_defaults rsc-options: \<br>

> > > >>> resource-stickiness=100 \        migration-threshold=3 \<br>

> > > >>> failure-timeout=1m \        cluster-recheck-<br>

> > > interval=10minop_defaults<br>

> > > >>> op-options: \        timeout=600 \        record-<br>

> > > pending=true*<br>

> > > >>><br>

> > > >>> On a 4-node setup when the whole cluster is brought up<br>

> > > together we see<br>

> > > >>> error logs like:<br>

> > > >>><br>

> > > >>> *2023-06-26T11:35:17.231104+00:00 FILE-1 pacemaker-<br>

> > > schedulerd[26359]:<br>

> > > >>> warning: Fencing and resource management disabled due to lack<br>

> > > of quorum*<br>

> > > >>><br>

> > > >>> *2023-06-26T11:35:17.231338+00:00 FILE-1 pacemaker-<br>

> > > schedulerd[26359]:<br>

> > > >>> warning: Ignoring malformed node_state entry without uname*<br>

> > > >>><br>

> > > >>> *2023-06-26T11:35:17.233771+00:00 FILE-1 pacemaker-<br>

> > > schedulerd[26359]:<br>

> > > >>> warning: Node FILE-2 is unclean!*<br>

> > > >>><br>

> > > >>> *2023-06-26T11:35:17.233857+00:00 FILE-1 pacemaker-<br>

> > > schedulerd[26359]:<br>

> > > >>> warning: Node FILE-3 is unclean!*<br>

> > > >>><br>

> > > >>> *2023-06-26T11:35:17.233957+00:00 FILE-1 pacemaker-<br>

> > > schedulerd[26359]:<br>

> > > >>> warning: Node FILE-4 is unclean!*<br>

> > > >>><br>

> > > >><br>

> > > >> According to this output FILE-1 lost connection to three other<br>

> > > nodes, in<br>

> > > >> which case it cannot be quorate.<br>

> > > >><br>

> > > >>><br>

> > > >>> Kindly help correct the configuration to make the system<br>

> > > function<br>

> > > >> normally<br>

> > > >>> with all resources up, even if there is just one node up.<br>

> > > >>><br>

> > > >>> Please let me know if any more info is needed.<br>

> > > >>><br>

> > > >>> Thanks<br>

> > > >>> Priyanka<br>

-- <br>

Ken Gaillot <<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>><br>

<br>

_______________________________________________<br>

Manage your subscription:<br>

<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

<br>

ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

</blockquote></div></div>

</blockquote></div></div>

</blockquote></div></div>