<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Oct 10, 2024 at 3:58 PM Angelo Ruggiero via Users <<a href="mailto:users@clusterlabs.org">users@clusterlabs.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="msg5594183960408677331">


<div dir="ltr">

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">

Thanks for answering. It helps.</div>

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">

<br>

</div>

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">

>Main scenario where poison pill shines is 2-node-clusters where you don't<br>

>have usable quorum for watchdog-fencing.</div>

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">

<br>

</div>

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">

Not sure i understand. As if just 2 node and one node fails it cannot respond to the poision pilll. Maybe i mis your point.</div></div></div></blockquote><div><br></div><div>If in a 2 node setup one node loses contact to the other or sees some other reason why it would like</div><div>the partner-node to be fenced it will try to write the poison-pill message to the shared disk and if that</div><div>goes Ok and after a configured wait time for the other node to read the message, respond or the</div><div>watchdog to kick in it will assume the other node to be fenced.  </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="msg5594183960408677331"><div dir="ltr">

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">

<br>

</div>

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">

This also begs the followup question, what defines "usable quroum".  Do you mean for example on seperate  independent network hardware and power supply?</div></div></div></blockquote><div><br></div><div>Quorum in 2 node clusters is a bit different as they will stay quorate when losing connection. To prevent split brain there if they</div><div>reboot on top they will just regain quorum once they've seen each other (search for 'wait-for-all' to read more).</div><div>This behavior is of course not usable for watchdog-fencing and thus SBD automatically switches to not relying on quorum in</div><div>those 2-node setups.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="msg5594183960408677331"><div dir="ltr">

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">

<br>

</div>

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">

>Configured with pacemaker-awareness - default - availability of the shared-disk doesn't become an issue as, due to fallback to availability of the 2nd node,  the disk is >no spof (single point of failure) in these clusters.</div>

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">

<br>

</div>

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">

I did not get the jist of what you are trying to say here. 🙂<br>

<br></div></div></div></blockquote><div><br></div><div>I was suggesting a scenario that has 2 cluster nodes + a single shared disk. With kind of 'pure' SBD this would mean that a node</div><div>that is losing connection to the disk would have to self fence which would mean that this disk would become a so called</div><div>single-point-of-failure - meaning that available of resources in the cluster would be reduced to availability of this single disk.</div><div>So I tried to explain why you don't have to fear this reduction of availability using pacemaker-awareness.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="msg5594183960408677331"><div dir="ltr"><div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">

</div>

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">

>Other nodes btw. can still kill a node with watchdog-fencing. I</div>

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">

<br>

</div>

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">

How does that work when would the killing node tell the other node not to keep triggering its watchdog? </div>

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">

Having written the above sentence maybe it should go and read up when does the poison pill get sent by the killing node!</div>

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">

<br></div></div></div></blockquote><div><br></div><div>It would either use cluster-communication to tell the node to self-fence and if that isn't available the case</div><div>below kicks in.</div><div><br></div><div>Hope that makes things a bit clearer.</div><div><br></div><div>Regards,</div><div>Klaus</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="msg5594183960408677331"><div dir="ltr"><div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">

>If the node isn't able to accept that wish of another</div>

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">

>node for it to die it will have lost quorum, have stopped triggering the watchdog anyway.</div>

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">

<br>

</div>

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">

Yes that is clear to mean the self-fencing is quite powerful.</div>

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">

<br>

</div>

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">

Thanks for the response.</div>

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">

<br>

</div>

<div id="m_5594183960408677331appendonsend"></div>

<hr style="display:inline-block;width:98%">

<div id="m_5594183960408677331divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Users <<a href="mailto:users-bounces@clusterlabs.org" target="_blank">users-bounces@clusterlabs.org</a>> on behalf of <a href="mailto:users-request@clusterlabs.org" target="_blank">users-request@clusterlabs.org</a> <<a href="mailto:users-request@clusterlabs.org" target="_blank">users-request@clusterlabs.org</a>><br>

<b>Sent:</b> 10 October 2024 2:00 PM<br>

<b>To:</b> <a href="mailto:users@clusterlabs.org" target="_blank">users@clusterlabs.org</a> <<a href="mailto:users@clusterlabs.org" target="_blank">users@clusterlabs.org</a>><br>

<b>Subject:</b> Users Digest, Vol 117, Issue 5</font>

<div> </div>

</div>

<div><font size="2"><span style="font-size:11pt">

<div>Send Users mailing list submissions to<br>

        <a href="mailto:users@clusterlabs.org" target="_blank">users@clusterlabs.org</a><br>

<br>

To subscribe or unsubscribe via the World Wide Web, visit<br>

        <a href="https://lists.clusterlabs.org/mailman/listinfo/users" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

or, via email, send a message with subject or body 'help' to<br>

        <a href="mailto:users-request@clusterlabs.org" target="_blank">users-request@clusterlabs.org</a><br>

<br>

You can reach the person managing the list at<br>

        <a href="mailto:users-owner@clusterlabs.org" target="_blank">users-owner@clusterlabs.org</a><br>

<br>

When replying, please edit your Subject line so it is more specific<br>

than "Re: Contents of Users digest..."<br>

<br>

<br>

Today's Topics:<br>

<br>

   1. Re: Fencing Approach (Klaus Wenninger)<br>

<br>

<br>

----------------------------------------------------------------------<br>

<br>

Message: 1<br>

Date: Wed, 9 Oct 2024 19:03:09 +0200<br>

From: Klaus Wenninger <<a href="mailto:kwenning@redhat.com" target="_blank">kwenning@redhat.com</a>><br>

To: Cluster Labs - All topics related to open-source clustering<br>

        welcomed <<a href="mailto:users@clusterlabs.org" target="_blank">users@clusterlabs.org</a>><br>

Cc: Angelo Ruggiero <<a href="mailto:angeloruggiero@yahoo.com" target="_blank">angeloruggiero@yahoo.com</a>><br>

Subject: Re: [ClusterLabs] Fencing Approach<br>

Message-ID:<br>

        <CALrDAo332oqdTEMX82nc-BPWJ=<a href="mailto:Ea4n_citY71HLamSOv3Kw-cA@mail.gmail.com" target="_blank">Ea4n_citY71HLamSOv3Kw-cA@mail.gmail.com</a>><br>

Content-Type: text/plain; charset="utf-8"<br>

<br>

On Wed, Oct 9, 2024 at 3:08?PM Angelo Ruggiero via Users <<br>

<a href="mailto:users@clusterlabs.org" target="_blank">users@clusterlabs.org</a>> wrote:<br>

<br>

> Hello,<br>

><br>

> My setup....<br>

><br>

><br>

>    - We are setting up a pacemaker cluster to run SAP runnig on RHEL on<br>

>    Vmware virtual machines.<br>

>    - We will have two nodes for the application server of SAP and 2 nodes<br>

>    for the Hana database. SAP/RHEL provide good support on how to setup the<br>

>    cluster. ?<br>

>    - SAP will need a number of floating Ips to be moved around as well<br>

>    mountin/unmounted NFS file system coming from a NetApp device. SAP will<br>

>    need processes switching on and off when something happens planned or<br>

>    unplanned.I am not clear if the netapp devic is active and the other site<br>

>    is DR but what i know is the ip addresses just get moved during a DR<br>

>    incident. Just to be complete the HANA data sync is done by HANA itself<br>

>    most probably async with an RPO of 15mins or so.<br>

>    -  We will have a quorum node also with hopefully a seperate network<br>

>    not sure if it will be on a seperate vmware infra though.<br>

>    - I am hoping to be allowed to use the vmware watchdog although it<br>

>    might take some persuading as declared as "non standard" for us by our<br>

>    infra people. I have it already in DEV to play with now.<br>

><br>

> I managed to set the above working just using a floating ip and a nfs<br>

> mount as my resources and I can see the following. The self fencing<br>

> approach works fine i.e the servers reboot when they loose network<br>

> connectivity and/or become in quorate as long as they are offering<br>

> resources.<br>

><br>

> So my questions are in relation to further fencing .... I did a lot of<br>

> reading and saw various reference...<br>

><br>

><br>

>    1. Use of sbd shared storage<br>

><br>

> The question is what does using sbd with a shated storage really give me.<br>

> I need to justify why i need this shared storage again to the infra guys<br>

> but to be honest also to myself.   I have been given this infra and will<br>

> play with it next few days.<br>

><br>

><br>

>    2. Use of fence vmware<br>

><br>

> In addition there is the ability of course to fence using the fence_vmware<br>

> agents and I again I need to justify why i need this. In this particular<br>

> cases it will be a very hard sell because the dev/test and prod<br>

> environments run on the same vmware infra so to use fence_vmware would<br>

> effectively mean dev is connected to prod i.e the user id for a dev or test<br>

> box is being provided by a production environment. I do not have this<br>

> ability at all so cannot play with it.<br>

><br>

><br>

><br>

> My current thought train...i.e the typical things i think about...<br>

><br>

> Perhaps someone can help me be clear on the benefits of 1 and 2 over and<br>

> above the setup i think it doable.<br>

><br>

><br>

>    1.  gives me the ability to use poison pill<br>

><br>

>    But what scenarios does poison pill really help why would the other<br>

>    parts of the cluster want to fence the node if the node itself has not<br>

>    killed it self as it lost quorum either because quorum devcice gone or<br>

>    network connectivity failed and resources needs to be switched off.<br>

><br>

>               What i get is that it is very explict i.e the others nodes<br>

> tell the other server to die. So it must be a case initiated by the other<br>

> nodes.<br>

>               I am struggling to think of a scenarios where the other<br>

> nodes would want to fence it.<br>

><br>

<br>

Main scenario where poison pill shines is 2-node-clusters where you don't<br>

have usable quorum for watchdog-fencing.<br>

Configured with pacemaker-awareness - default - availability of the<br>

shared-disk doesn't become an issue as, due to<br>

fallback to availability of the 2nd node,  the disk is no spof (single<br>

point of failure) in these clusters.<br>

Other nodes btw. can still kill a node with watchdog-fencing. If the node<br>

isn't able to accept that wish of another<br>

node for it to die it will have lost quorum, have stopped triggering the<br>

watchdog anyway.<br>

<br>

Regards,<br>

Klaus<br>

<br>

><br>

> Possible Scenarios, did i miss any?<br>

><br>

>    - Loss of network connection to the node. But that is covered by the<br>

>    node self fencing<br>

>    - If some monitoring said the node was not healthly or responding...<br>

>    Maybe this is the case it is good for but then it must be a partial failure<br>

>    where the node is still part fof the cluster and can respond. I.e not OS<br>

>    freeze or only it looses connection as then the watchdog or the self<br>

>    fencing will kick in.<br>

>    - HW failures, cpu, memory, disk For virtual hardware does that<br>

>    actually ever fail? Sorry if stupid question. I could ask our infra guys<br>

>    but....,<br>

>    So is virtual hardware so reliable that hw failures can be ignored.<br>

>    - Loss of shared storage SAP uses a lot of shared storage via NFS. Not<br>

>    sure what happens when that fails need to research it a bit but each node<br>

>    will sort that out itself I am presuming.<br>

>    - Human error: but no cluster will fix that and the human who makes a<br>

>    change will realise it and revert. ?<br>

><br>

>        2. Fence vmware<br>

><br>

>       I see this as a better poision pill as it works at the hardware<br>

> level. But if I do not need poision pill then i do not need this.<br>

><br>

> In general OS freezes or even panics if take took long are covered by the<br>

> watchdog.<br>

><br>

> regards<br>

> Angelo<br>

><br>

><br>

><br>

><br>

><br>

> _______________________________________________<br>

> Manage your subscription:<br>

> <a href="https://lists.clusterlabs.org/mailman/listinfo/users" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

><br>

> ClusterLabs home: <a href="https://www.clusterlabs.org/" target="_blank">https://www.clusterlabs.org/</a><br>

><br>

-------------- next part --------------<br>

An HTML attachment was scrubbed...<br>

URL: <<a href="https://lists.clusterlabs.org/pipermail/users/attachments/20241009/b9a58eb1/attachment-0001.htm" target="_blank">https://lists.clusterlabs.org/pipermail/users/attachments/20241009/b9a58eb1/attachment-0001.htm</a>><br>

<br>

------------------------------<br>

<br>

Subject: Digest Footer<br>

<br>

_______________________________________________<br>

Manage your subscription:<br>

<a href="https://lists.clusterlabs.org/mailman/listinfo/users" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

<br>

ClusterLabs home: <a href="https://www.clusterlabs.org/" target="_blank">https://www.clusterlabs.org/</a><br>

<br>

<br>

------------------------------<br>

<br>

End of Users Digest, Vol 117, Issue 5<br>

*************************************<br>

</div>

</span></font></div>

</div>


_______________________________________________<br>

Manage your subscription:<br>

<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

<br>

ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

</div></blockquote></div></div>