[ClusterLabs] pacemaker with sbd fails to start if node reboots too fast.

Tue Dec 5 14:33:32 EST 2017

05.12.2017 13:34, Gao,Yan пишет:
> On 12/05/2017 08:57 AM, Dejan Muhamedagic wrote:
>> On Mon, Dec 04, 2017 at 09:55:46PM +0300, Andrei Borzenkov wrote:
>>> 04.12.2017 14:48, Gao,Yan пишет:
>>>> On 12/02/2017 07:19 PM, Andrei Borzenkov wrote:
>>>>> 30.11.2017 13:48, Gao,Yan пишет:
>>>>>> On 11/22/2017 08:01 PM, Andrei Borzenkov wrote:
>>>>>>> SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with
>>>>>>> VM on VSphere using shared VMDK as SBD. During basic tests by
>>>>>>> killing
>>>>>>> corosync and forcing STONITH pacemaker was not started after reboot.
>>>>>>> In logs I see during boot
>>>>>>>
>>>>>>> Nov 22 16:04:56 sapprod01s crmd[3151]:     crit: We were allegedly
>>>>>>> just fenced by sapprod01p for sapprod01p
>>>>>>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]:  warning: The crmd
>>>>>>> process (3151) can no longer be respawned,
>>>>>>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]:   notice: Shutting down
>>>>>>> Pacemaker
>>>>>>>
>>>>>>> SBD timeouts are 60s for watchdog and 120s for msgwait. It seems
>>>>>>> that
>>>>>>> stonith with SBD always takes msgwait (at least, visually host is
>>>>>>> not
>>>>>>> declared as OFFLINE until 120s passed). But VM rebots lightning fast
>>>>>>> and is up and running long before timeout expires.
>>>>>>>
>>>>>>> I think I have seen similar report already. Is it something that can
>>>>>>> be fixed by SBD/pacemaker tuning?
>>>>>> SBD_DELAY_START=yes in /etc/sysconfig/sbd is the solution.
>>>>>>
>>>>>
>>>>> I tried it (on openSUSE Tumbleweed which is what I have at hand, it
>>>>> has
>>>>> SBD 1.3.0) and with SBD_DELAY_START=yes sbd does not appear to watch
>>>>> disk at all.
>>>> It simply waits that long on startup before starting the rest of the
>>>> cluster stack to make sure the fencing that targeted it has
>>>> returned. It
>>>> intentionally doesn't watch anything during this period of time.
>>>>
>>>
>>> Unfortunately it waits too long.
>>>
>>> ha1:~ # systemctl status sbd.service
>>> ● sbd.service - Shared-storage based fencing daemon
>>>     Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor
>>> preset: disabled)
>>>     Active: failed (Result: timeout) since Mon 2017-12-04 21:47:03 MSK;
>>> 4min 16s ago
>>>    Process: 1861 ExecStop=/usr/bin/kill -TERM $MAINPID (code=exited,
>>> status=0/SUCCESS)
>>>    Process: 2058 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid
>>> watch (code=killed, signa
>>>   Main PID: 1792 (code=exited, status=0/SUCCESS)
>>>
>>> дек 04 21:45:32 ha1 systemd[1]: Starting Shared-storage based fencing
>>> daemon...
>>> дек 04 21:47:02 ha1 systemd[1]: sbd.service: Start operation timed out.
>>> Terminating.
>>> дек 04 21:47:03 ha1 systemd[1]: Failed to start Shared-storage based
>>> fencing daemon.
>>> дек 04 21:47:03 ha1 systemd[1]: sbd.service: Unit entered failed state.
>>> дек 04 21:47:03 ha1 systemd[1]: sbd.service: Failed with result
>>> 'timeout'.
>>>
>>> But the real problem is - in spite of SBD failed to start, the whole
>>> cluster stack continues to run; and because SBD blindly trusts in well
>>> behaving nodes, fencing appears to succeed after timeout ... without
>>> anyone taking any action on poison pill ...
>>
>> That's something I always wondered about: if a node is capable of
>> reading a poison pill then it could before shutdown also write an
>> "I'm leaving" message into its slot. Wouldn't that make sbd more
>> reliable? Any reason not to implement that?
> Probably it's not considered necessary :) SBD is a fencing mechanism
> which only needs to ensure fencing works.

I'm sorry, but SBD has zero chances to ensure fencing works. Recently I
did storage vMotion of VM with shared VMDK for SBD - it silently created
copy of VMDK which was indistinguishable from original one. As result
both VMs run with own copy. Of course fencing did not work - but each VM
*assumed* it worked because it posted message and waited for timeout ...

I would expect "monitor" action of SBD fencing agent to actually test
whether messages are seen by remote node(s) ...

> SBD on the fencing target is
> either there eating the pill or getting reset by watchdog, otherwise
> it's not there which is supposed to imply the whole cluster stack is not
> running so that it doesn't need to actually eat the pill.
> 
> How systemd should handle the service dependencies is another topic...
>