[ClusterLabs] SBD as watchdog daemon

Klaus Wenninger kwenning at redhat.com
Thu Apr 11 13:00:40 EDT 2019


On 4/11/19 5:27 PM, Олег Самойлов wrote:
> Hi all.
> I am developing HA PostgreSQL cluster for 2 or 3 datacenters. In case of DataCenter failure (blackout) the fencing will not work and will prevent to switching to working DC. So I disable the fencing. The cluster working is based on a quorum and I added a quorum device on a third DC in case of 2 DC. But I need somehow solve 
Why would you disable fencing? SBD with watchdog-fencing (no shared
disk) is made for exactly that use-case but you need fencing to
be enabled and stonith-watchdog-timeout to be set to roughly 2x the
watchdog-timeout.
Regarding a node restart to be triggered that shouldn't make much
difference but if you disable fencing you won't get the remaining
cluster to wait for the missing node to be reset and proceed afterwards
(regardless if the lost node shows up again or not).
> cases when corosync or pacemaker is freeze. In this case I use a hw watchdog or a softdog and SBD as watchdog daemon (without shared devices). Well, after this if I kill the corosync or the pacemakerd, all fine, the node is restarted. And if I freeze sbd by `killall -s STOP sbd`, all fine, reboots.  But if I freeze corosync or pacemakerd by `killall -s STOP` or by `ifdown eth0` (corosync is frozen in this case), nothing happened. The question is «Is this is fixed in the master branch or in 1.4.0?» (I use centos rpms: sbd v1.3.1) or where I need to look for (in what file, function) to fix this.
Referring to the above I'm not sure how you did configure sbd.

I'm not aware of any fixes directly targeting issues like that since v1.3.1.
There are 2 post v1.4.0 fixes that might be helpful in some cases though.
(make handling of cib-connection loss more robust & finalize cmap
connection if disconnected from cluster)

ifdown of the corosync-interface definitely gives me a reboot on a
cluster with corosync-3 and current sbd from master.
But iirc there was an improvement regarding this in corosync.
Freezing corosync or pacemakerd on the other hand doesn't trigger anything.
For doing a regular ping to corosync via cpg there is an outstanding PR
that should help here - unfortunately needs to be rebased to current sbd
(don't find it atm - strange)

Regarding pacemakerd that should be a little bit more complicated as
pacemakerd is just the main control daemon.
So if you freeze that it shouldn't be harmful for the first but of
course as pacemakerd is doing the observation of the rest of the
pacemaker-daemons it should be somehow watchdog-observed. iirc there
were some tests by hideo using corosync-watchdog-device-integration. But
these attempts unfortunately slept in as well. You should find some
discussion in the mailinglist-archives about it. Unfortunately having
corosync open a watchdog-device makes it fight with sbd for that
resource. But a generic solution isn't that simple as not every setup is
using sbd.

Klaus
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



More information about the Users mailing list