[ClusterLabs] Antw: Re: Antw: Re: Antw: Re: pacemaker with sbd fails to start if node reboots too fast.
Gao,Yan
ygao at suse.com
Tue Dec 5 11:46:06 EST 2017
On 12/05/2017 03:11 PM, Ulrich Windl wrote:
>
>
>>>> "Gao,Yan" <ygao at suse.com> schrieb am 05.12.2017 um 15:04 in Nachricht
> <f3433dca-d654-0eac-80d6-2f92aeb3e894 at suse.com>:
>> On 12/05/2017 12:41 PM, Ulrich Windl wrote:
>>>
>>>
>>>>>> "Gao,Yan" <ygao at suse.com> schrieb am 01.12.2017 um 20:36 in Nachricht
>>> <e49f3c0a-6981-3ab4-a0b0-1e5f49f34a25 at suse.com>:
>
> [...]
>>>
>>> I meant: There are three delays:
>>> 1) The delay until data is on the disk
>> It takes several IOs for the sender to do this -- read the device
>> header, lookup the slot, write the message and verify the message is
>> written (-- A timeout_io defaults to 3s).
>>
>> As mentioned, msgwait timer of the sender starts only after message has
>> been verified to be written. We just need to make sure stonith-timeout
>> is configured longer enough than the sum.
>>
>>> 2) Delay until date is read from the disk
>> It's already taken into account with msgwait. Considering the recipient
>> keeps reading in a loop, we don't know when exactly it starts to read
>> for this specific message. But once it starts a reading, it has to be
>> done within timeout_watchdog, otherwise watchdog triggers. So even for a
>> bad case, the message should be read within 2* timemout_watchdog. That's
>> the reason why the sender has to wait msgwait, which is 2 *
>> timeout_watchdog.
>>
>>> 3) Delay until Host was killed
>> Kill is basically immediately triggered once poison pill is read.
>
> Considering that the response time of a SAN disk system with cache is typically a very few microseconds, writing to disk may be even "more immediate" than killing the node via watchdog reset ;-)
Well, it's possible :) Timeout matters for "bad cases" though. Compared
with a disk io facing difficulties like path failure and so on,
triggering watchdog is trivial.
> So you can't easily say one is immediate, while the other has to be waited for IMHO.
Of course a even longer msgwait with all the factors that you can think
of taken into account will be even safer.
Regards,
Yan
>
> Regards,
> Ulrich
>
>>
>>> A confirmation before 3) could shorten the total wait that includes 2) and
>> 3),
>>> right?
>> As mentioned in another email, an alive node, even indeed coming back
>> from death, cannot actually confirm itself or even give a confirmation
>> about if it was ever dead. And a successful fencing means the node being
>> dead.
>>
>> Regards,
>> Yan
>>
>>
>>>
>>> Regards,
>>> Ulrich
>>>
>>>
>>>>
>>>> Regards,
>>>> Yan
>>>>
> [...]
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
More information about the Users
mailing list