[ClusterLabs] SBD as watchdog daemon

Fri Apr 12 08:30:14 EDT 2019

> 11 апр. 2019 г., в 20:00, Klaus Wenninger <kwenning at redhat.com> написал(а):
> 
> On 4/11/19 5:27 PM, Олег Самойлов wrote:
>> Hi all.
>> I am developing HA PostgreSQL cluster for 2 or 3 datacenters. In case of DataCenter failure (blackout) the fencing will not work and will prevent to switching to working DC. So I disable the fencing. The cluster working is based on a quorum and I added a quorum device on a third DC in case of 2 DC. But I need somehow solve 
> Why would you disable fencing? SBD with watchdog-fencing (no shared
> disk) is made for exactly that use-case but you need fencing to
> be enabled and stonith-watchdog-timeout to be set to roughly 2x the
> watchdog-timeout.

Interesting. There are a lot in documentation about using the sbd with 1,2,3 block devices, but about using without block devices is nothing, except a sentence that this is possible. :)

> Regarding a node restart to be triggered that shouldn't make much
> difference but if you disable fencing you won't get the remaining
> cluster to wait for the missing node to be reset and proceed afterwards
> (regardless if the lost node shows up again or not).

Yep, in my case this will good for floating IPs.

>> cases when corosync or pacemaker is freeze. In this case I use a hw watchdog or a softdog and SBD as watchdog daemon (without shared devices). Well, after this if I kill the corosync or the pacemakerd, all fine, the node is restarted. And if I freeze sbd by `killall -s STOP sbd`, all fine, reboots.  But if I freeze corosync or pacemakerd by `killall -s STOP` or by `ifdown eth0` (corosync is frozen in this case), nothing happened. The question is «Is this is fixed in the master branch or in 1.4.0?» (I use centos rpms: sbd v1.3.1) or where I need to look for (in what file, function) to fix this.
> Referring to the above I'm not sure how you did configure sbd.

Just 
pcs stonith sbd enable
pcs property set stonith-enabled=false
Now I change to
pcs stonith sbd enable
pcs property set stonith-enabled=true
pcs property set stonith-watchdog-timeout=12

> 
> ifdown of the corosync-interface definitely gives me a reboot on a
> cluster with corosync-3 and current sbd from master.

I tested with corosync 2.4.3 (default for CentOS 7). Or may be in your case reboot was happened by the fencing, But no matter, if a watchdog will work as expected.

> But iirc there was an improvement regarding this in corosync.
> Freezing corosync or pacemakerd on the other hand doesn't trigger anything.
> For doing a regular ping to corosync via cpg there is an outstanding PR
> that should help here - unfortunately needs to be rebased to current sbd
> (don't find it atm - strange)
> 
> Regarding pacemakerd that should be a little bit more complicated as
> pacemakerd is just the main control daemon.
> So if you freeze that it shouldn't be harmful for the first but of
> course as pacemakerd is doing the observation of the rest of the
> pacemaker-daemons it should be somehow watchdog-observed. iirc there
> were some tests by hideo using corosync-watchdog-device-integration. But
> these attempts unfortunately slept in as well. You should find some
> discussion in the mailinglist-archives about it. Unfortunately having
> corosync open a watchdog-device makes it fight with sbd for that
> resource. But a generic solution isn't that simple as not every setup is
> using sbd.

Well, I see, freezeing of pacemaker daemons is not monitoring by the watchdog daemon (sbd). It’s strange, I see two «Watcher» daemon from sbd, one for corosync, other for pacemakerd, they must do something useful. :) I want that the behaviour was at least the same as with normal fencing. In case of fencing if corosync or pacemaker freeze, failure node is fenced.