[ClusterLabs] Antw: Re: Antw: Re: pacemaker with sbd fails to start if node reboots too fast.

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Tue Dec 5 11:41:29 UTC 2017



>>> "Gao,Yan" <ygao at suse.com> schrieb am 01.12.2017 um 20:36 in Nachricht
<e49f3c0a-6981-3ab4-a0b0-1e5f49f34a25 at suse.com>:
> On 11/30/2017 06:48 PM, Andrei Borzenkov wrote:
>> 30.11.2017 16:11, Klaus Wenninger пишет:
>>> On 11/30/2017 01:41 PM, Ulrich Windl wrote:
>>>>
>>>>>>> "Gao,Yan" <ygao at suse.com> schrieb am 30.11.2017 um 11:48 in Nachricht
>>>> <e71afccc-06e3-97dd-c66a-1b4bac550c23 at suse.com>:
>>>>> On 11/22/2017 08:01 PM, Andrei Borzenkov wrote:
>>>>>> SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with
>>>>>> VM on VSphere using shared VMDK as SBD. During basic tests by killing
>>>>>> corosync and forcing STONITH pacemaker was not started after reboot.
>>>>>> In logs I see during boot
>>>>>>
>>>>>> Nov 22 16:04:56 sapprod01s crmd[3151]:     crit: We were allegedly
>>>>>> just fenced by sapprod01p for sapprod01p
>>>>>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]:  warning: The crmd
>>>>>> process (3151) can no longer be respawned,
>>>>>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]:   notice: Shutting down
>>>>> Pacemaker
>>>>>> SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that
>>>>>> stonith with SBD always takes msgwait (at least, visually host is not
>>>>>> declared as OFFLINE until 120s passed). But VM rebots lightning fast
>>>>>> and is up and running long before timeout expires.
>>>> As msgwait was intended for the message to arrive, and not for the reboot

> time (I guess), this just shows a fundamental problem in SBD design: Receipt

> of the fencing command is not confirmed (other than by seeing the 
> consequences of ist execution).
>>>
>>> The 2 x msgwait is not for confirmations but for writing the poison-pill
>>> and for
>>> having it read by the target-side.
>> 
>> Yes, of course, but that's not what Urlich likely intended to say.
>> msgwait must account for worst case storage path latency, while in
>> normal cases it happens much faster. If fenced node could acknowledge
>> having been killed after reboot, stonith agent could return success much
>> earlier.
> How could an alive man be sure he died before? ;)

I meant: There are three delays:
1) The delay until data is on the disk
2) Delay until date is read from the disk
3) Delay until Host was killed

A confirmation before 3) could shorten the total wait that includes 2) and 3),
right?

Regards,
Ulrich


> 
> Regards,
>    Yan
> 
>> 
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org 
>> http://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
>> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org




More information about the Users mailing list