[ClusterLabs] SBD as watchdog daemon

Sun Apr 14 03:12:54 EDT 2019

12.04.2019 15:30, Олег Самойлов пишет:
> 
>> 11 апр. 2019 г., в 20:00, Klaus Wenninger <kwenning at redhat.com>
>> написал(а):
>> 
>> On 4/11/19 5:27 PM, Олег Самойлов wrote:
>>> Hi all. I am developing HA PostgreSQL cluster for 2 or 3
>>> datacenters. In case of DataCenter failure (blackout) the fencing
>>> will not work and will prevent to switching to working DC. So I
>>> disable the fencing. The cluster working is based on a quorum and
>>> I added a quorum device on a third DC in case of 2 DC. But I need
>>> somehow solve
>> Why would you disable fencing? SBD with watchdog-fencing (no
>> shared disk) is made for exactly that use-case but you need fencing
>> to be enabled and stonith-watchdog-timeout to be set to roughly 2x
>> the watchdog-timeout.
> 
> Interesting. There are a lot in documentation about using the sbd
> with 1,2,3 block devices, but about using without block devices is
> nothing, except a sentence that this is possible. :)
> 

Yes, stonith-watchdog-timeout does not really ring the bell and does not
make it obvious it is related to SBD in any way.

The way it seems to works is

If stonithd^Wpaxcemaker-fenced receives request to kill node and no
suitable stonith device for this node was found and SBD is active and
stonith-watchdog-timeout is non-zero, fenced will a) forward request to
victim and b) wait for specified timeout and expect node to self fence.
If victim is alive it will initiate reboot either via local SBD or via
SysRq if it could not contact SBD. If victim is not reachable, it is
expected that SBD will commit suicide.

You will see in logs something like

Apr 14 09:13:45 ha1 pacemaker-fenced    [1808] (call_remote_stonith)
notice: Waiting 10s for ha2 to self-fence (reboot) for
pacemaker-controld.1812.5a81fe48 ((nil))

Unfortunately

1. It is by far not obvious. You only see something during actual
stonith attempt. You see

Apr 14 09:13:45 ha1 pacemaker-schedulerd[1811] (unpack_config)  notice:
Watchdog will be used via SBD if fencing is required

sprinkled over log file, but it is misleading - it will *NOT* use SBD
fencing unless stonith-watchdog-timeout is actually set to non-zero. And
you see the following log entry exactly once during normal startup:

Apr 14 09:10:15 ha1 pacemaker-controld  [1812] (check_sbd_timeout)
debug: Using calculated value 10000 for stonith-watchdog-timeout (-1)
Apr 14 09:10:15 ha1 pacemaker-controld  [1812] (check_sbd_timeout)
info: Watchdog configured with stonith-watchdog-timeout -1 and SBD
timeout 5000ms

And it may be too far in the past and already rotated away.

2. No high-level command showing current pacemaker run-time state shows
you whether this mechanism is active. OK, if you know what you are
looking for you may use direct CIB query to check values of
have-watchdog and stonith-watchdog-timeout. But where pray is
have-watchdog documented?

Apparently intention was to create special internal stonith device with
name "watchdog" which would be visible in at least stonith_admin -L
output (not sure whether crm_mon would show it). But as implemented
currently, fenced would (attempt to) create this device once very early
during initial startup *if* stonith-watchdog-timeout is not zero - but
it queries for the value of stonith-watchdog-timeout(or at least
receives reply) far later, which means this special device is never
created. Which leads to funny duplication of code which contains
identical handling of both no device and special "watchdog" device
cases. If anything, this is confusing to anyone looking at the code.

Apr 14 09:09:54 ha1 pacemaker-fenced    [1808] (main)   info: Starting
pacemaker-fenced mainloop
^^^^^^^^^^^^^^^ this line is already past attempt to create special
watchdog device

Apr 14 09:10:15 ha1 pacemaker-controld  [1812] (check_sbd_timeout)
info: Watchdog configured with stonith-watchdog-timeout -1 and SBD
timeout 5000ms

And of course stonith-watchdog-timeout can be changed to 0 at run-time
so this device would now have to be removed even if it had been created
successfully in the first place. So it really need to be created in CIB
update handler. Which leaves window between changing this value and
creation of special device so probably duplication of code is inevitable.

If anyone thinks it is a bug, I will open bug report.

3. If node could not be contacted for whatever reasons and has not
received self-fence request it will still be assumed to be fenced and
pacemaker will start relocating resources. I do not know whether
pacemaker cross-checks with actual node state (i.e. node is expected to
be lost at the point watchdog timeout is expired).