[ClusterLabs] Wrong sbd.service dependencies (was: Re: pacemaker with sbd fails to start if node reboots too fast)
Andrei Borzenkov
arvidjaar at gmail.com
Sat Dec 16 10:59:41 EST 2017
04.12.2017 21:55, Andrei Borzenkov пишет:
...
>>>
>>> I tried it (on openSUSE Tumbleweed which is what I have at hand, it has
>>> SBD 1.3.0) and with SBD_DELAY_START=yes sbd does not appear to watch
>>> disk at all.
>> It simply waits that long on startup before starting the rest of the
>> cluster stack to make sure the fencing that targeted it has returned. It
>> intentionally doesn't watch anything during this period of time.
>>
>
> Unfortunately it waits too long.
>
> ha1:~ # systemctl status sbd.service
> ● sbd.service - Shared-storage based fencing daemon
> Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor
> preset: disabled)
> Active: failed (Result: timeout) since Mon 2017-12-04 21:47:03 MSK;
> 4min 16s ago
> Process: 1861 ExecStop=/usr/bin/kill -TERM $MAINPID (code=exited,
> status=0/SUCCESS)
> Process: 2058 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid
> watch (code=killed, signa
> Main PID: 1792 (code=exited, status=0/SUCCESS)
>
> дек 04 21:45:32 ha1 systemd[1]: Starting Shared-storage based fencing
> daemon...
> дек 04 21:47:02 ha1 systemd[1]: sbd.service: Start operation timed out.
> Terminating.
> дек 04 21:47:03 ha1 systemd[1]: Failed to start Shared-storage based
> fencing daemon.
> дек 04 21:47:03 ha1 systemd[1]: sbd.service: Unit entered failed state.
> дек 04 21:47:03 ha1 systemd[1]: sbd.service: Failed with result 'timeout'.
>
> But the real problem is - in spite of SBD failed to start, the whole
> cluster stack continues to run; and because SBD blindly trusts in well
> behaving nodes, fencing appears to succeed after timeout ... without
> anyone taking any action on poison pill ...
>
That's sbd bug. It declares itself as RequiredBy=corosync.service but
puts itself Before=pacemaker.service. Due to systemd design, service A
*MUST* have Before dependency on service B if failure to start A should
cause failure to start B. *Or* use BindsTo ... but that sounds wrong
because it would cause B to start briefly and then be killed.
So the question is what is intended here. Should sbd.service be
prerequisite for corosync or pacemaker? Should failure to start SBD be
fatal for startup of dependent service? Finally does sbd need explicit
dependency on pacemaker.service at all (in addition to corosync.service)?
Adding Before dependency fixes startup logic for me.
ha1:~ # systemctl start pacemaker.service
A dependency job for pacemaker.service failed. See 'journalctl -xe' for
details.
ha1:~ # systemctl -l --no-pager status pacemaker.service
● pacemaker.service - Pacemaker High Availability Cluster Manager
Loaded: loaded (/etc/systemd/system/pacemaker.service; disabled;
vendor preset: disabled)
Active: inactive (dead)
Docs: man:pacemakerd
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/index.html
дек 16 18:56:06 ha1 systemd[1]: Dependency failed for Pacemaker High
Availability Cluster Manager.
дек 16 18:56:06 ha1 systemd[1]: pacemaker.service: Job
pacemaker.service/start failed with result 'dependency'.
ha1:~ # systemctl -l --no-pager status corosync.service
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/usr/lib/systemd/system/corosync.service; static;
vendor preset: disabled)
Active: inactive (dead)
Docs: man:corosync
man:corosync.conf
man:corosync_overview
дек 16 18:56:06 ha1 systemd[1]: Dependency failed for Corosync Cluster
Engine.
дек 16 18:56:06 ha1 systemd[1]: corosync.service: Job
corosync.service/start failed with result 'dependency'.
ha1:~ # systemctl -l --no-pager status sbd.service
● sbd.service - Shared-storage based fencing daemon
Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor
preset: disabled)
Drop-In: /etc/systemd/system/sbd.service.d
└─before-corosync.conf
Active: failed (Result: timeout) since Sat 2017-12-16 18:56:06 MSK;
50s ago
Process: 3675 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid
watch (code=killed, signal=TERM)
дек 16 18:54:36 ha1 systemd[1]: Starting Shared-storage based fencing
daemon...
дек 16 18:56:06 ha1 systemd[1]: sbd.service: Start operation timed out.
Terminating.
дек 16 18:56:06 ha1 systemd[1]: Failed to start Shared-storage based
fencing daemon.
дек 16 18:56:06 ha1 systemd[1]: sbd.service: Unit entered failed state.
дек 16 18:56:06 ha1 systemd[1]: sbd.service: Failed with result 'timeout'.
ha1:~ # cat /etc/systemd/system/sbd.service.d/before-corosync.conf
[Unit]
Before=corosync.service
ha1:~ #
More information about the Users
mailing list