[ClusterLabs] Antw: Re: Antw: Re: pacemaker with sbd fails to start if node reboots too fast.
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Tue Dec 5 12:41:29 CET 2017
>>> "Gao,Yan" <ygao at suse.com> schrieb am 01.12.2017 um 20:36 in Nachricht
<e49f3c0a-6981-3ab4-a0b0-1e5f49f34a25 at suse.com>:
> On 11/30/2017 06:48 PM, Andrei Borzenkov wrote:
>> 30.11.2017 16:11, Klaus Wenninger пишет:
>>> On 11/30/2017 01:41 PM, Ulrich Windl wrote:
>>>>
>>>>>>> "Gao,Yan" <ygao at suse.com> schrieb am 30.11.2017 um 11:48 in Nachricht
>>>> <e71afccc-06e3-97dd-c66a-1b4bac550c23 at suse.com>:
>>>>> On 11/22/2017 08:01 PM, Andrei Borzenkov wrote:
>>>>>> SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with
>>>>>> VM on VSphere using shared VMDK as SBD. During basic tests by killing
>>>>>> corosync and forcing STONITH pacemaker was not started after reboot.
>>>>>> In logs I see during boot
>>>>>>
>>>>>> Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly
>>>>>> just fenced by sapprod01p for sapprod01p
>>>>>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]: warning: The crmd
>>>>>> process (3151) can no longer be respawned,
>>>>>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]: notice: Shutting down
>>>>> Pacemaker
>>>>>> SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that
>>>>>> stonith with SBD always takes msgwait (at least, visually host is not
>>>>>> declared as OFFLINE until 120s passed). But VM rebots lightning fast
>>>>>> and is up and running long before timeout expires.
>>>> As msgwait was intended for the message to arrive, and not for the reboot
> time (I guess), this just shows a fundamental problem in SBD design: Receipt
> of the fencing command is not confirmed (other than by seeing the
> consequences of ist execution).
>>>
>>> The 2 x msgwait is not for confirmations but for writing the poison-pill
>>> and for
>>> having it read by the target-side.
>>
>> Yes, of course, but that's not what Urlich likely intended to say.
>> msgwait must account for worst case storage path latency, while in
>> normal cases it happens much faster. If fenced node could acknowledge
>> having been killed after reboot, stonith agent could return success much
>> earlier.
> How could an alive man be sure he died before? ;)
I meant: There are three delays:
1) The delay until data is on the disk
2) Delay until date is read from the disk
3) Delay until Host was killed
A confirmation before 3) could shorten the total wait that includes 2) and 3),
right?
Regards,
Ulrich
>
> Regards,
> Yan
>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list