[ClusterLabs] Wrong sbd.service dependencies

Sun Dec 17 18:52:20 UTC 2017

On 12/17/2017 06:10 PM, Andrei Borzenkov wrote:
> 17.12.2017 15:20, Gao,Yan пишет:
>> On 2017/12/16 16:59, Andrei Borzenkov wrote:
>>> 04.12.2017 21:55, Andrei Borzenkov пишет:
>>> ...
>>>>>> I tried it (on openSUSE Tumbleweed which is what I have at hand, it
>>>>>> has
>>>>>> SBD 1.3.0) and with SBD_DELAY_START=yes sbd does not appear to watch
>>>>>> disk at all.
>>>>> It simply waits that long on startup before starting the rest of the
>>>>> cluster stack to make sure the fencing that targeted it has
>>>>> returned. It
>>>>> intentionally doesn't watch anything during this period of time.
>>>>>
>>>> Unfortunately it waits too long.
>>>>
>>>> ha1:~ # systemctl status sbd.service
>>>> ● sbd.service - Shared-storage based fencing daemon
>>>>     Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor
>>>> preset: disabled)
>>>>     Active: failed (Result: timeout) since Mon 2017-12-04 21:47:03 MSK;
>>>> 4min 16s ago
>>>>    Process: 1861 ExecStop=/usr/bin/kill -TERM $MAINPID (code=exited,
>>>> status=0/SUCCESS)
>>>>    Process: 2058 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid
>>>> watch (code=killed, signa
>>>>   Main PID: 1792 (code=exited, status=0/SUCCESS)
>>>>
>>>> дек 04 21:45:32 ha1 systemd[1]: Starting Shared-storage based fencing
>>>> daemon...
>>>> дек 04 21:47:02 ha1 systemd[1]: sbd.service: Start operation timed out.
>>>> Terminating.
>>>> дек 04 21:47:03 ha1 systemd[1]: Failed to start Shared-storage based
>>>> fencing daemon.
>>>> дек 04 21:47:03 ha1 systemd[1]: sbd.service: Unit entered failed state.
>>>> дек 04 21:47:03 ha1 systemd[1]: sbd.service: Failed with result
>>>> 'timeout'.
>>>>
>>>> But the real problem is - in spite of SBD failed to start, the whole
>>>> cluster stack continues to run; and because SBD blindly trusts in well
>>>> behaving nodes, fencing appears to succeed after timeout ... without
>>>> anyone taking any action on poison pill ...
>>>>
>>> That's sbd bug. It declares itself as RequiredBy=corosync.service but
>>> puts itself Before=pacemaker.service. Due to systemd design, service A
>>> *MUST* have Before dependency on service B if failure to start A should
>>> cause failure to start B. *Or* use BindsTo ... but that sounds wrong
>>> because it would cause B to start briefly and then be killed.
>>>
>>> So the question is what is intended here. Should sbd.service be
>>> prerequisite for corosync or pacemaker? 
>> It should be so only if it's enabled. Try this:
>> https://github.com/ClusterLabs/sbd/pull/39
>>
> This is wrong, I commented on this pull request.

Not sure if it is that simple ... added some additional comment to the PR.

>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org