[ClusterLabs] Two nodes cluster issue

Mon Aug 7 05:37:07 EDT 2017

Just updating that I added another level of fencing using watchdog-fencing.
With the quorum device and this combination works in case of power failure of both server and ipmi interface.
An important note is that the stonith-watchdog-timeout must be configured in order to work.
After reading the following great post: http://blog.clusterlabs.org/blog/2015/sbd-fun-and-profit , I choose the softdog watchdog since the I don't think ipmi watchdog will do no good in case the ipmi interface is down (If it is OK it will be used as a fencing method).

Just for documenting the solution (in case someone else needed that), the configuration I added is:
systemctl enable sbd 
pcs property set no-quorum-policy=suicide
pcs property set stonith-watchdog-timeout=15
pcs quorum device add model net host=qdevice algorithm=lms

I just can't decide if the qdevice algorithm should be lms or ffsplit. I couldn't determine the difference between them and I'm not sure which one is the best when using two node cluster with qdevice and watchdog fencing.

Can anyone advise on that?

-----Original Message-----
From: Jan Friesse [mailto:jfriesse at redhat.com] 
Sent: Tuesday, July 25, 2017 11:59 AM
To: Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>; kwenning at redhat.com; Prasad, Shashank <ssprasad at vanu.com>
Subject: Re: [ClusterLabs] Two nodes cluster issue

> Tomer Azran napsal(a):
>> I tend to agree with Klaus – I don't think that having a hook that 
>> bypass stonith is the right way. It is better to not use stonith at all.
>> I think I will try to use an iScsi target on my qdevice and set SBD 
>> to use it.
>> I still don't understand why qdevice can't take the place SBD with 
>> shared storage; correct me if I'm wrong, but it looks like both of 
>> them are there for the same reason.
>
> Qdevice is there to be third side arbiter who decides which partition 
> is quorate. It can also be seen as a quorum only node. So for two node 
> cluster it can be viewed as a third node (eventho it is quite special 
> because it cannot run resources). It is not doing fencing.
>
> SBD is fencing device. It is using disk as a third side arbiter.

I've talked with Klaus and he told me that 7.3 is not using disk as a third side arbiter so sorry for confusion.

You should however still be able to use sbd for checking if pacemaker is alive and if the partition has quorum - otherwise the watchdog kills the node. So qdevice will give you "3rd" node and sbd fences unquorate partition.

Or (as mentioned previously) you can use fabric fencing.

Regards,
   Honza

>
>
>>
>> From: Klaus Wenninger [mailto:kwenning at redhat.com]
>> Sent: Monday, July 24, 2017 9:01 PM
>> To: Cluster Labs - All topics related to open-source clustering 
>> welcomed <users at clusterlabs.org>; Prasad, Shashank 
>> <ssprasad at vanu.com>
>> Subject: Re: [ClusterLabs] Two nodes cluster issue
>>
>> On 07/24/2017 07:32 PM, Prasad, Shashank wrote:
>> Sometimes IPMI fence devices use shared power of the node, and it 
>> cannot be avoided.
>> In such scenarios the HA cluster is NOT able to handle the power 
>> failure of a node, since the power is shared with its own fence device.
>> The failure of IPMI based fencing can also exist due to other reasons 
>> also.
>>
>> A failure to fence the failed node will cause cluster to be marked 
>> UNCLEAN.
>> To get over it, the following command needs to be invoked on the 
>> surviving node.
>>
>> pcs stonith confirm <failed_node_name> --force
>>
>> This can be automated by hooking a recovery script, when the the 
>> Stonith resource ‘Timed Out’ event.
>> To be more specific, the Pacemaker Alerts can be used for watch for 
>> Stonith timeouts and failures.
>> In that script, all that’s essentially to be executed is the 
>> aforementioned command.
>>
>> If I get you right here you can disable fencing then in the first place.
>> Actually quorum-based-watchdog-fencing is the way to do this in a 
>> safe manner. This of course assumes you have a proper source for 
>> quorum in your 2-node-setup with e.g. qdevice or using a shared disk 
>> with sbd (not directly pacemaker quorum here but similar thing 
>> handled inside sbd).
>>
>>
>> Since the alerts are issued from ‘hacluster’ login, sudo permissions 
>> for ‘hacluster’ needs to be configured.
>>
>> Thanx.
>>
>>
>> From: Klaus Wenninger [mailto:kwenning at redhat.com]
>> Sent: Monday, July 24, 2017 9:24 PM
>> To: Kristián Feldsam; Cluster Labs - All topics related to 
>> open-source clustering welcomed
>> Subject: Re: [ClusterLabs] Two nodes cluster issue
>>
>> On 07/24/2017 05:37 PM, Kristián Feldsam wrote:
>> I personally think that power off node by switched pdu is more safe, 
>> or not?
>>
>> True if that is working in you environment. If you can't do a 
>> physical setup where you aren't simultaneously loosing connection to 
>> both your node and the switch-device (or you just want to cover cases 
>> where that happens) you have to come up with something else.
>>
>>
>>
>>
>> S pozdravem Kristián Feldsam
>> Tel.: +420 773 303 353, +421 944 137 535
>> E-mail.: support at feldhost.cz<mailto:support at feldhost.cz>
>>
>> www.feldhost.cz<http://www.feldhost.cz> - FeldHost™ – profesionální 
>> hostingové a serverové služby za adekvátní ceny.
>>
>> FELDSAM s.r.o.
>> V rohu 434/3
>> Praha 4 – Libuš, PSČ 142 00
>> IČ: 290 60 958, DIČ: CZ290 60 958
>> C 200350 vedená u Městského soudu v Praze
>>
>> Banka: Fio banka a.s.
>> Číslo účtu: 2400330446/2010
>> BIC: FIOBCZPPXX
>> IBAN: CZ82 2010 0000 0024 0033 0446
>>
>> On 24 Jul 2017, at 17:27, Klaus Wenninger 
>> <kwenning at redhat.com<mailto:kwenning at redhat.com>> wrote:
>>
>> On 07/24/2017 05:15 PM, Tomer Azran wrote:
>> I still don't understand why the qdevice concept doesn't help on this 
>> situation. Since the master node is down, I would expect the quorum 
>> to declare it as dead.
>> Why doesn't it happens?
>>
>> That is not how quorum works. It just limits the decision-making to 
>> the quorate subset of the cluster.
>> Still the unknown nodes are not sure to be down.
>> That is why I suggested to have quorum-based watchdog-fencing with sbd.
>> That would assure that within a certain time all nodes of the 
>> non-quorate part of the cluster are down.
>>
>>
>>
>>
>>
>>
>> On Mon, Jul 24, 2017 at 4:15 PM +0300, "Dmitri Maziuk"
>> <dmitri.maziuk at gmail.com<mailto:dmitri.maziuk at gmail.com>> wrote:
>>
>> On 2017-07-24 07:51, Tomer Azran wrote:
>>
>>> We don't have the ability to use it.
>>
>>> Is that the only solution?
>>
>>
>>
>> No, but I'd recommend thinking about it first. Are you sure you will
>>
>> care about your cluster working when your server room is on fire? 
>> 'Cause
>>
>> unless you have halon suppression, your server room is a complete
>>
>> write-off anyway. (Think water from sprinklers hitting rich chunky 
>> volts
>>
>> in the servers.)
>>
>>
>>
>> Dima
>>
>>
>>
>> _______________________________________________
>>
>> Users mailing list: 
>> Users at clusterlabs.org<mailto:Users at clusterlabs.org>
>>
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>>
>>
>> Project Home: http://www.clusterlabs.org<http://www.clusterlabs.org/>
>>
>> Getting started: 
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>
>> Bugs: http://bugs.clusterlabs.org<http://bugs.clusterlabs.org/>
>>
>>
>>
>>
>>
>> _______________________________________________
>>
>> Users mailing list: 
>> Users at clusterlabs.org<mailto:Users at clusterlabs.org>
>>
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>>
>>
>> Project Home: http://www.clusterlabs.org<http://www.clusterlabs.org/>
>>
>> Getting started: 
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>
>> Bugs: http://bugs.clusterlabs.org<http://bugs.clusterlabs.org/>
>>
>>
>> --
>>
>> Klaus Wenninger
>>
>>
>>
>> Senior Software Engineer, EMEA ENG Openstack Infrastructure
>>
>>
>>
>> Red Hat
>>
>>
>>
>> kwenning at redhat.com<mailto:kwenning at redhat.com>
>> _______________________________________________
>> Users mailing list: 
>> Users at clusterlabs.org<mailto:Users at clusterlabs.org>
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org<http://www.clusterlabs.org/>
>> Getting started: 
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org<http://bugs.clusterlabs.org/>
>>
>>
>>
>>
>>
>> _______________________________________________
>>
>> Users mailing list: 
>> Users at clusterlabs.org<mailto:Users at clusterlabs.org>
>>
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>>
>>
>> Project Home: http://www.clusterlabs.org
>>
>> Getting started: 
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org 
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org Getting started: 
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Users mailing list: Users at clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org