[ClusterLabs] SBD as watchdog daemon

Олег Самойлов splarv at ya.ru
Fri Apr 12 08:30:14 EDT 2019


> 11 апр. 2019 г., в 20:00, Klaus Wenninger <kwenning at redhat.com> написал(а):
> 
> On 4/11/19 5:27 PM, Олег Самойлов wrote:
>> Hi all.
>> I am developing HA PostgreSQL cluster for 2 or 3 datacenters. In case of DataCenter failure (blackout) the fencing will not work and will prevent to switching to working DC. So I disable the fencing. The cluster working is based on a quorum and I added a quorum device on a third DC in case of 2 DC. But I need somehow solve 
> Why would you disable fencing? SBD with watchdog-fencing (no shared
> disk) is made for exactly that use-case but you need fencing to
> be enabled and stonith-watchdog-timeout to be set to roughly 2x the
> watchdog-timeout.

Interesting. There are a lot in documentation about using the sbd with 1,2,3 block devices, but about using without block devices is nothing, except a sentence that this is possible. :)

> Regarding a node restart to be triggered that shouldn't make much
> difference but if you disable fencing you won't get the remaining
> cluster to wait for the missing node to be reset and proceed afterwards
> (regardless if the lost node shows up again or not).

Yep, in my case this will good for floating IPs.

>> cases when corosync or pacemaker is freeze. In this case I use a hw watchdog or a softdog and SBD as watchdog daemon (without shared devices). Well, after this if I kill the corosync or the pacemakerd, all fine, the node is restarted. And if I freeze sbd by `killall -s STOP sbd`, all fine, reboots.  But if I freeze corosync or pacemakerd by `killall -s STOP` or by `ifdown eth0` (corosync is frozen in this case), nothing happened. The question is «Is this is fixed in the master branch or in 1.4.0?» (I use centos rpms: sbd v1.3.1) or where I need to look for (in what file, function) to fix this.
> Referring to the above I'm not sure how you did configure sbd.

Just 
pcs stonith sbd enable
pcs property set stonith-enabled=false
Now I change to
pcs stonith sbd enable
pcs property set stonith-enabled=true
pcs property set stonith-watchdog-timeout=12

> 
> ifdown of the corosync-interface definitely gives me a reboot on a
> cluster with corosync-3 and current sbd from master.

I tested with corosync 2.4.3 (default for CentOS 7). Or may be in your case reboot was happened by the fencing, But no matter, if a watchdog will work as expected.

> But iirc there was an improvement regarding this in corosync.
> Freezing corosync or pacemakerd on the other hand doesn't trigger anything.
> For doing a regular ping to corosync via cpg there is an outstanding PR
> that should help here - unfortunately needs to be rebased to current sbd
> (don't find it atm - strange)
> 
> Regarding pacemakerd that should be a little bit more complicated as
> pacemakerd is just the main control daemon.
> So if you freeze that it shouldn't be harmful for the first but of
> course as pacemakerd is doing the observation of the rest of the
> pacemaker-daemons it should be somehow watchdog-observed. iirc there
> were some tests by hideo using corosync-watchdog-device-integration. But
> these attempts unfortunately slept in as well. You should find some
> discussion in the mailinglist-archives about it. Unfortunately having
> corosync open a watchdog-device makes it fight with sbd for that
> resource. But a generic solution isn't that simple as not every setup is
> using sbd.

Well, I see, freezeing of pacemaker daemons is not monitoring by the watchdog daemon (sbd). It’s strange, I see two «Watcher» daemon from sbd, one for corosync, other for pacemakerd, they must do something useful. :) I want that the behaviour was at least the same as with normal fencing. In case of fencing if corosync or pacemaker freeze, failure node is fenced.


More information about the Users mailing list