[ClusterLabs] pacemaker with sbd fails to start if node reboots too fast.

Thu Nov 30 06:50:37 EST 2017

On Thu, Nov 30, 2017 at 1:48 PM, Gao,Yan <ygao at suse.com> wrote:
> On 11/22/2017 08:01 PM, Andrei Borzenkov wrote:
>>
>> SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with
>> VM on VSphere using shared VMDK as SBD. During basic tests by killing
>> corosync and forcing STONITH pacemaker was not started after reboot.
>> In logs I see during boot
>>
>> Nov 22 16:04:56 sapprod01s crmd[3151]:     crit: We were allegedly
>> just fenced by sapprod01p for sapprod01p
>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]:  warning: The crmd
>> process (3151) can no longer be respawned,
>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]:   notice: Shutting down
>> Pacemaker
>>
>> SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that
>> stonith with SBD always takes msgwait (at least, visually host is not
>> declared as OFFLINE until 120s passed). But VM rebots lightning fast
>> and is up and running long before timeout expires.
>>
>> I think I have seen similar report already. Is it something that can
>> be fixed by SBD/pacemaker tuning?
>
> SBD_DELAY_START=yes in /etc/sysconfig/sbd is the solution.
>

Sounds promising. Is it enough? Comment in /etc/sysconfig/sbd says
"Whether to delay after starting sbd on boot for "msgwait" seconds.",
but as I understand, stonith agent timeout is 2 * msgwait.