[ClusterLabs] Two nodes cluster issue

Sun Jul 30 23:37:02 EDT 2017

Just updating that I added another level of fencing using watchdog-fencing.
With the quorum device and this combination works in case of power failure of both server and ipmi interface.
An important note is that the stonith-watchdog-timeout must be configured in order to work.
After reading the following great post: http://blog.clusterlabs.org/blog/2015/sbd-fun-and-profit , I choose the softdog watchdog since the I don't think ipmi watchdog will do no good in case the ipmi interface is down (If it is OK it will be used as a fencing method).

Just for documenting the solution (in case someone else needed that), the configuration I added is:
systemctl enable sbd
pcs property set no-quorum-policy=suicide
pcs property set stonith-watchdog-timeout=15
pcs quorum device add model net host=qdevice algorithm=lms

I just can't decide if the qdevice algorithm should be lms or ffsplit. I couldn't determine the difference between them and I'm not sure which one is the best when using two node cluster with qdevice and watchdog fencing.

Can anyone advise on that?

From: Klaus Wenninger [mailto:kwenning at redhat.com]
Sent: Tuesday, July 25, 2017 2:19 AM
To: Tomer Azran <tomer.azran at edp.co.il>; Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>; Prasad, Shashank <ssprasad at vanu.com>
Subject: Re: [ClusterLabs] Two nodes cluster issue

On 07/24/2017 11:59 PM, Tomer Azran wrote:
There is a problem with that – it seems like SBD with shared disk is disabled on CentOS 7.3:

When I run:
# sbd -d /dev/sbd create

I get:
Shared disk functionality not supported

Which is why I suggested to go for watchdog-fencing using
your qdevice setup.
As said I haven't tried with qdevice-quorum - but I don't
see a reason why that shouldn't work.
no-quorum-policy has to be suicide of course.

So I might try the software watchdog (softgod or ipmi_watchdog)

A reliable watchdog is really crucial for sbd so I would
recommend going for ipmi or anything else that has
hardware behind.

Klaus

Tomer.

From: Tomer Azran [mailto:tomer.azran at edp.co.il]
Sent: Tuesday, July 25, 2017 12:30 AM
To: kwenning at redhat.com<mailto:kwenning at redhat.com>; Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org><mailto:users at clusterlabs.org>; Prasad, Shashank <ssprasad at vanu.com><mailto:ssprasad at vanu.com>
Subject: Re: [ClusterLabs] Two nodes cluster issue

I tend to agree with Klaus – I don't think that having a hook that bypass stonith is the right way. It is better to not use stonith at all.

That was of course with a certain degree of hyperbolism. Anything is of course better than not having
fencing at all.
I might be wrong but what you were saying somehow was drawing a picture in my mind that you
have your 2 nodes at 2 sites/rooms quite separated and in that case ...

I think I will try to use an iScsi target on my qdevice and set SBD to use it.
I still don't understand why qdevice can't take the place SBD with shared storage; correct me if I'm wrong, but it looks like both of them are there for the same reason.

sbd with watchdog + qdevice can take the place of sbd with shared storage.
qdevice is there to decide which part of a cluster is quorate and which not - in cases
where after a split this wouldn't be possible.
sbd (with watchdog) is then there to reliably take down the non-quorate part
within a well defined time.

From: Klaus Wenninger [mailto:kwenning at redhat.com]
Sent: Monday, July 24, 2017 9:01 PM
To: Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org<mailto:users at clusterlabs.org>>; Prasad, Shashank <ssprasad at vanu.com<mailto:ssprasad at vanu.com>>
Subject: Re: [ClusterLabs] Two nodes cluster issue

On 07/24/2017 07:32 PM, Prasad, Shashank wrote:
Sometimes IPMI fence devices use shared power of the node, and it cannot be avoided.
In such scenarios the HA cluster is NOT able to handle the power failure of a node, since the power is shared with its own fence device.
The failure of IPMI based fencing can also exist due to other reasons also.

A failure to fence the failed node will cause cluster to be marked UNCLEAN.
To get over it, the following command needs to be invoked on the surviving node.

pcs stonith confirm <failed_node_name> --force

This can be automated by hooking a recovery script, when the the Stonith resource ‘Timed Out’ event.
To be more specific, the Pacemaker Alerts can be used for watch for Stonith timeouts and failures.
In that script, all that’s essentially to be executed is the aforementioned command.

If I get you right here you can disable fencing then in the first place.
Actually quorum-based-watchdog-fencing is the way to do this in a
safe manner. This of course assumes you have a proper source for
quorum in your 2-node-setup with e.g. qdevice or using a shared
disk with sbd (not directly pacemaker quorum here but similar thing
handled inside sbd).

Since the alerts are issued from ‘hacluster’ login, sudo permissions for ‘hacluster’ needs to be configured.

Thanx.

From: Klaus Wenninger [mailto:kwenning at redhat.com]
Sent: Monday, July 24, 2017 9:24 PM
To: Kristián Feldsam; Cluster Labs - All topics related to open-source clustering welcomed
Subject: Re: [ClusterLabs] Two nodes cluster issue

On 07/24/2017 05:37 PM, Kristián Feldsam wrote:
I personally think that power off node by switched pdu is more safe, or not?

True if that is working in you environment. If you can't do a physical setup
where you aren't simultaneously loosing connection to both your node and
the switch-device (or you just want to cover cases where that happens)
you have to come up with something else.

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: support at feldhost.cz<mailto:support at feldhost.cz>

www.feldhost.cz<http://www.feldhost.cz> - FeldHost™ – profesionální hostingové a serverové služby za adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010 0000 0024 0033 0446

On 24 Jul 2017, at 17:27, Klaus Wenninger <kwenning at redhat.com<mailto:kwenning at redhat.com>> wrote:

On 07/24/2017 05:15 PM, Tomer Azran wrote:
I still don't understand why the qdevice concept doesn't help on this situation. Since the master node is down, I would expect the quorum to declare it as dead.
Why doesn't it happens?

That is not how quorum works. It just limits the decision-making to the quorate subset of the cluster.
Still the unknown nodes are not sure to be down.
That is why I suggested to have quorum-based watchdog-fencing with sbd.
That would assure that within a certain time all nodes of the non-quorate part
of the cluster are down.

On Mon, Jul 24, 2017 at 4:15 PM +0300, "Dmitri Maziuk" <dmitri.maziuk at gmail.com<mailto:dmitri.maziuk at gmail.com>> wrote:

On 2017-07-24 07:51, Tomer Azran wrote:

> We don't have the ability to use it.

> Is that the only solution?

No, but I'd recommend thinking about it first. Are you sure you will

care about your cluster working when your server room is on fire? 'Cause

unless you have halon suppression, your server room is a complete

write-off anyway. (Think water from sprinklers hitting rich chunky volts

in the servers.)

Dima

_______________________________________________

Users mailing list: Users at clusterlabs.org<mailto:Users at clusterlabs.org>

http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org<http://www.clusterlabs.org/>

Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Bugs: http://bugs.clusterlabs.org<http://bugs.clusterlabs.org/>

_______________________________________________

Users mailing list: Users at clusterlabs.org<mailto:Users at clusterlabs.org>

http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org<http://www.clusterlabs.org/>

Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Bugs: http://bugs.clusterlabs.org<http://bugs.clusterlabs.org/>

--

Klaus Wenninger

Senior Software Engineer, EMEA ENG Openstack Infrastructure

Red Hat

kwenning at redhat.com<mailto:kwenning at redhat.com>
_______________________________________________
Users mailing list: Users at clusterlabs.org<mailto:Users at clusterlabs.org>
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org<http://www.clusterlabs.org/>
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org<http://bugs.clusterlabs.org/>

_______________________________________________

Users mailing list: Users at clusterlabs.org<mailto:Users at clusterlabs.org>

http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org

Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20170731/5ce280c4/attachment-0003.html>