<div dir="ltr">In this case stonith has been configured as a resource, <div><b>primitive stonith-sbd stonith:external/sbd</b><br></div><div><b><br></b></div><div>For it to be functional properly , the resource needs to be up, which is only possible if the system is quorate. </div><div>Hence our requirement is to make the system quorate even if one Node of the cluster is up.</div><div>Stonith will then take care of any split-brain scenarios. </div><div><br></div><div>Thanks</div><div>Priyanka</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jun 27, 2023 at 9:06 PM Klaus Wenninger <<a href="mailto:kwenning@redhat.com">kwenning@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jun 27, 2023 at 5:24 PM Andrei Borzenkov <<a href="mailto:arvidjaar@gmail.com" target="_blank">arvidjaar@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 27.06.2023 07:21, Priyanka Balotra wrote:<br>

> Hi Andrei,<br>

> After this state the system went through some more fencings and we saw the<br>

> following state:<br>

> <br>

> :~ # crm status<br>

> Cluster Summary:<br>

>    * Stack: corosync<br>

>    * Current DC: FILE-2 (version<br>

> 2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36) - partition<br>

> with quorum<br>

<br>

It says "partition with quorum" so what exactly is the problem?<br></blockquote><div><br></div><div>I guess the problem is that resources aren't being recovered on</div><div>the nodes in the quorate partition.</div><div>Reason for that is probably that - as Ken was already suggesting - fencing isn't</div><div>working properly or fencing-devices used are simply inappropriate for the </div><div>purpose (e.g. onboard IPMI).</div><div>The fact that a node is rebooting isn't enough. The node that initiated fencing</div><div>has to know that it did actually work. But we're just guessing here. Logs should</div><div>show what is actually going on.</div><div><br></div><div>Klaus</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

>    * Last updated: Mon Jun 26 12:44:15 2023<br>

>    * Last change:  Mon Jun 26 12:41:12 2023 by root via cibadmin on FILE-2<br>

>    * 4 nodes configured<br>

>    * 11 resource instances configured<br>

> <br>

> Node List:<br>

>    * Node FILE-1: UNCLEAN (offline)<br>

>    * Node FILE-4: UNCLEAN (offline)<br>

>    * Online: [ FILE-2 ]<br>

>    * Online: [ FILE-3 ]<br>

> <br>

> At this stage FILE-1 and FILE-4 were continuously getting fenced (we have<br>

> device based stonith configured but the resource was not up ) .<br>

> Two nodes were online and two were offline. So quorum wasn't attained<br>

> again.<br>

> 1)  For such a scenario we need help to be able to have one cluster live .<br>

> 2)  And in cases where only one node of the cluster is up and others are<br>

> down we need the resources and cluster to be up .<br>

> <br>

> Thanks<br>

> Priyanka<br>

> <br>

> On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov <<a href="mailto:arvidjaar@gmail.com" target="_blank">arvidjaar@gmail.com</a>><br>

> wrote:<br>

> <br>

>> On 26.06.2023 21:14, Priyanka Balotra wrote:<br>

>>> Hi All,<br>

>>> We are seeing an issue where we replaced no-quorum-policy=ignore with<br>

>> other<br>

>>> options in corosync.conf order to simulate the same behaviour :<br>

>>><br>

>>><br>

>>> *     wait_for_all: 0*<br>

>>><br>

>>> *        last_man_standing: 1        last_man_standing_window: 20000*<br>

>>><br>

>>> There was another property (auto-tie-breaker) tried but couldn't<br>

>> configure<br>

>>> it as crm did not recognise this property.<br>

>>><br>

>>> But even after using these options, we are seeing that system is not<br>

>>> quorate if at least half of the nodes are not up.<br>

>>><br>

>>> Some properties from crm config are as follows:<br>

>>><br>

>>><br>

>>><br>

>>> *primitive stonith-sbd stonith:external/sbd \        params<br>

>>> pcmk_delay_base=5s.*<br>

>>><br>

>>><br>

>>><br>

>>><br>

>>><br>

>>><br>

>>><br>

>>><br>

>>><br>

>>><br>

>>><br>

>>><br>

>>><br>

>>><br>

>>><br>

>>><br>

>>><br>

>>><br>

>>><br>

>>><br>

>>> *.property cib-bootstrap-options: \        have-watchdog=true \<br>

>>><br>

>> dc-version="2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36"<br>

>>> \        cluster-infrastructure=corosync \        cluster-name=FILE \<br>

>>>     stonith-enabled=true \        stonith-timeout=172 \<br>

>>> stonith-action=reboot \        stop-all-resources=false \<br>

>>> no-quorum-policy=ignorersc_defaults build-resource-defaults: \<br>

>>> resource-stickiness=1rsc_defaults rsc-options: \<br>

>>> resource-stickiness=100 \        migration-threshold=3 \<br>

>>> failure-timeout=1m \        cluster-recheck-interval=10minop_defaults<br>

>>> op-options: \        timeout=600 \        record-pending=true*<br>

>>><br>

>>> On a 4-node setup when the whole cluster is brought up together we see<br>

>>> error logs like:<br>

>>><br>

>>> *2023-06-26T11:35:17.231104+00:00 FILE-1 pacemaker-schedulerd[26359]:<br>

>>> warning: Fencing and resource management disabled due to lack of quorum*<br>

>>><br>

>>> *2023-06-26T11:35:17.231338+00:00 FILE-1 pacemaker-schedulerd[26359]:<br>

>>> warning: Ignoring malformed node_state entry without uname*<br>

>>><br>

>>> *2023-06-26T11:35:17.233771+00:00 FILE-1 pacemaker-schedulerd[26359]:<br>

>>> warning: Node FILE-2 is unclean!*<br>

>>><br>

>>> *2023-06-26T11:35:17.233857+00:00 FILE-1 pacemaker-schedulerd[26359]:<br>

>>> warning: Node FILE-3 is unclean!*<br>

>>><br>

>>> *2023-06-26T11:35:17.233957+00:00 FILE-1 pacemaker-schedulerd[26359]:<br>

>>> warning: Node FILE-4 is unclean!*<br>

>>><br>

>><br>

>> According to this output FILE-1 lost connection to three other nodes, in<br>

>> which case it cannot be quorate.<br>

>><br>

>>><br>

>>> Kindly help correct the configuration to make the system function<br>

>> normally<br>

>>> with all resources up, even if there is just one node up.<br>

>>><br>

>>> Please let me know if any more info is needed.<br>

>>><br>

>>> Thanks<br>

>>> Priyanka<br>

>>><br>

>>><br>

>>> _______________________________________________<br>

>>> Manage your subscription:<br>

>>> <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

>>><br>

>>> ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

>><br>

>> _______________________________________________<br>

>> Manage your subscription:<br>

>> <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

>><br>

>> ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

>><br>

> <br>

> <br>

> _______________________________________________<br>

> Manage your subscription:<br>

> <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

> <br>

> ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

<br>

_______________________________________________<br>

Manage your subscription:<br>

<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

<br>

ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

</blockquote></div></div>

</blockquote></div>