[ClusterLabs] Wrong sbd.service dependencies

Gao,Yan ygao at suse.com
Sun Dec 17 12:20:37 UTC 2017


On 2017/12/16 16:59, Andrei Borzenkov wrote:
> 04.12.2017 21:55, Andrei Borzenkov пишет:
> ...
>>>>
>>>> I tried it (on openSUSE Tumbleweed which is what I have at hand, it has
>>>> SBD 1.3.0) and with SBD_DELAY_START=yes sbd does not appear to watch
>>>> disk at all.
>>> It simply waits that long on startup before starting the rest of the
>>> cluster stack to make sure the fencing that targeted it has returned. It
>>> intentionally doesn't watch anything during this period of time.
>>>
>>
>> Unfortunately it waits too long.
>>
>> ha1:~ # systemctl status sbd.service
>> ● sbd.service - Shared-storage based fencing daemon
>>     Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor
>> preset: disabled)
>>     Active: failed (Result: timeout) since Mon 2017-12-04 21:47:03 MSK;
>> 4min 16s ago
>>    Process: 1861 ExecStop=/usr/bin/kill -TERM $MAINPID (code=exited,
>> status=0/SUCCESS)
>>    Process: 2058 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid
>> watch (code=killed, signa
>>   Main PID: 1792 (code=exited, status=0/SUCCESS)
>>
>> дек 04 21:45:32 ha1 systemd[1]: Starting Shared-storage based fencing
>> daemon...
>> дек 04 21:47:02 ha1 systemd[1]: sbd.service: Start operation timed out.
>> Terminating.
>> дек 04 21:47:03 ha1 systemd[1]: Failed to start Shared-storage based
>> fencing daemon.
>> дек 04 21:47:03 ha1 systemd[1]: sbd.service: Unit entered failed state.
>> дек 04 21:47:03 ha1 systemd[1]: sbd.service: Failed with result 'timeout'.
>>
>> But the real problem is - in spite of SBD failed to start, the whole
>> cluster stack continues to run; and because SBD blindly trusts in well
>> behaving nodes, fencing appears to succeed after timeout ... without
>> anyone taking any action on poison pill ...
>>
> 
> That's sbd bug. It declares itself as RequiredBy=corosync.service but
> puts itself Before=pacemaker.service. Due to systemd design, service A
> *MUST* have Before dependency on service B if failure to start A should
> cause failure to start B. *Or* use BindsTo ... but that sounds wrong
> because it would cause B to start briefly and then be killed.
> 
> So the question is what is intended here. Should sbd.service be
> prerequisite for corosync or pacemaker? 
It should be so only if it's enabled. Try this:
https://github.com/ClusterLabs/sbd/pull/39

Thanks to Klaus, btw.

Regards,
   Yan

> Should failure to start SBD be
> fatal for startup of dependent service? Finally does sbd need explicit
> dependency on pacemaker.service at all (in addition to corosync.service)?
> 
> Adding Before dependency fixes startup logic for me.
> 
> ha1:~ # systemctl start pacemaker.service
> A dependency job for pacemaker.service failed. See 'journalctl -xe' for
> details.
> ha1:~ # systemctl -l --no-pager status pacemaker.service
> ● pacemaker.service - Pacemaker High Availability Cluster Manager
>     Loaded: loaded (/etc/systemd/system/pacemaker.service; disabled;
> vendor preset: disabled)
>     Active: inactive (dead)
>       Docs: man:pacemakerd
> 
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/index.html
> 
> дек 16 18:56:06 ha1 systemd[1]: Dependency failed for Pacemaker High
> Availability Cluster Manager.
> дек 16 18:56:06 ha1 systemd[1]: pacemaker.service: Job
> pacemaker.service/start failed with result 'dependency'.
> ha1:~ # systemctl -l --no-pager status corosync.service
> ● corosync.service - Corosync Cluster Engine
>     Loaded: loaded (/usr/lib/systemd/system/corosync.service; static;
> vendor preset: disabled)
>     Active: inactive (dead)
>       Docs: man:corosync
>             man:corosync.conf
>             man:corosync_overview
> 
> дек 16 18:56:06 ha1 systemd[1]: Dependency failed for Corosync Cluster
> Engine.
> дек 16 18:56:06 ha1 systemd[1]: corosync.service: Job
> corosync.service/start failed with result 'dependency'.
> ha1:~ # systemctl -l --no-pager status sbd.service
> ● sbd.service - Shared-storage based fencing daemon
>     Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor
> preset: disabled)
>    Drop-In: /etc/systemd/system/sbd.service.d
>             └─before-corosync.conf
>     Active: failed (Result: timeout) since Sat 2017-12-16 18:56:06 MSK;
> 50s ago
>    Process: 3675 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid
> watch (code=killed, signal=TERM)
> 
> дек 16 18:54:36 ha1 systemd[1]: Starting Shared-storage based fencing
> daemon...
> дек 16 18:56:06 ha1 systemd[1]: sbd.service: Start operation timed out.
> Terminating.
> дек 16 18:56:06 ha1 systemd[1]: Failed to start Shared-storage based
> fencing daemon.
> дек 16 18:56:06 ha1 systemd[1]: sbd.service: Unit entered failed state.
> дек 16 18:56:06 ha1 systemd[1]: sbd.service: Failed with result 'timeout'.
> ha1:~ # cat /etc/systemd/system/sbd.service.d/before-corosync.conf
> [Unit]
> Before=corosync.service
> ha1:~ #
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 




More information about the Users mailing list