[ClusterLabs] pacemaker with sbd fails to start if node reboots too fast.
Andrei Borzenkov
arvidjaar at gmail.com
Mon Dec 4 13:55:46 EST 2017
04.12.2017 14:48, Gao,Yan пишет:
> On 12/02/2017 07:19 PM, Andrei Borzenkov wrote:
>> 30.11.2017 13:48, Gao,Yan пишет:
>>> On 11/22/2017 08:01 PM, Andrei Borzenkov wrote:
>>>> SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with
>>>> VM on VSphere using shared VMDK as SBD. During basic tests by killing
>>>> corosync and forcing STONITH pacemaker was not started after reboot.
>>>> In logs I see during boot
>>>>
>>>> Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly
>>>> just fenced by sapprod01p for sapprod01p
>>>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]: warning: The crmd
>>>> process (3151) can no longer be respawned,
>>>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]: notice: Shutting down
>>>> Pacemaker
>>>>
>>>> SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that
>>>> stonith with SBD always takes msgwait (at least, visually host is not
>>>> declared as OFFLINE until 120s passed). But VM rebots lightning fast
>>>> and is up and running long before timeout expires.
>>>>
>>>> I think I have seen similar report already. Is it something that can
>>>> be fixed by SBD/pacemaker tuning?
>>> SBD_DELAY_START=yes in /etc/sysconfig/sbd is the solution.
>>>
>>
>> I tried it (on openSUSE Tumbleweed which is what I have at hand, it has
>> SBD 1.3.0) and with SBD_DELAY_START=yes sbd does not appear to watch
>> disk at all.
> It simply waits that long on startup before starting the rest of the
> cluster stack to make sure the fencing that targeted it has returned. It
> intentionally doesn't watch anything during this period of time.
>
Unfortunately it waits too long.
ha1:~ # systemctl status sbd.service
● sbd.service - Shared-storage based fencing daemon
Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor
preset: disabled)
Active: failed (Result: timeout) since Mon 2017-12-04 21:47:03 MSK;
4min 16s ago
Process: 1861 ExecStop=/usr/bin/kill -TERM $MAINPID (code=exited,
status=0/SUCCESS)
Process: 2058 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid
watch (code=killed, signa
Main PID: 1792 (code=exited, status=0/SUCCESS)
дек 04 21:45:32 ha1 systemd[1]: Starting Shared-storage based fencing
daemon...
дек 04 21:47:02 ha1 systemd[1]: sbd.service: Start operation timed out.
Terminating.
дек 04 21:47:03 ha1 systemd[1]: Failed to start Shared-storage based
fencing daemon.
дек 04 21:47:03 ha1 systemd[1]: sbd.service: Unit entered failed state.
дек 04 21:47:03 ha1 systemd[1]: sbd.service: Failed with result 'timeout'.
But the real problem is - in spite of SBD failed to start, the whole
cluster stack continues to run; and because SBD blindly trusts in well
behaving nodes, fencing appears to succeed after timeout ... without
anyone taking any action on poison pill ...
ha1:~ # systemctl show sbd.service -p RequiredBy
RequiredBy=corosync.service
but
ha1:~ # systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/usr/lib/systemd/system/corosync.service; static;
vendor preset: disabled)
Active: active (running) since Mon 2017-12-04 21:45:33 MSK; 7min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Process: 1860 ExecStop=/usr/share/corosync/corosync stop (code=exited,
status=0/SUCCESS)
Process: 2059 ExecStart=/usr/share/corosync/corosync start
(code=exited, status=0/SUCCESS)
Main PID: 2073 (corosync)
Tasks: 2 (limit: 4915)
CGroup: /system.slice/corosync.service
└─2073 corosync
and
ha1:~ # crm_mon -1r
Stack: corosync
Current DC: ha1 (version 1.1.17-3.3-36d2962a8) - partition with quorum
Last updated: Mon Dec 4 21:53:24 2017
Last change: Mon Dec 4 21:47:25 2017 by hacluster via crmd on ha1
2 nodes configured
1 resource configured
Online: [ ha1 ha2 ]
Full list of resources:
stonith-sbd (stonith:external/sbd): Started ha1
and if I now sever connection between two nodes I will get two single
node clusters each believing it won ...
More information about the Users
mailing list