[ClusterLabs] Antw: Re: setting up SBD_WATCHDOG_TIMEOUT, stonith-timeout and stonith-watchdog-timeout

Fri Dec 9 14:51:42 UTC 2016

But what if sbd fails to reset
the timer multiple times (eg. because of excessive load, swap storm etc)?

If I remember, sbd has allocated memory with mlock and SCHED_RR in
this way, when server is swapping, sbd doesn't stop.

2016-12-09 8:11 GMT+01:00 Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>:
>>>> emmanuel segura <emi2fast at gmail.com> schrieb am 08.12.2016 um 14:37 in
> Nachricht
> <CAE7pJ3CSQyxQvBqLFvsFU=NLp95JWQdZvAP_6cLyBo5rSdCRng at mail.gmail.com>:
>> the only thing that I can say is: sbd is a realtime process
>
> Hi!
>
> You are saying it's scheduled with policy SCHED_RR and priority 0? A realtime-process is more than ist scheduling policy IMHO.
> What are you really trying to say?
>
> Regards,
> Ulrich
>
>>
>> 2016-12-08 11:47 GMT+01:00 Jehan-Guillaume de Rorthais <jgdr at dalibo.com>:
>>> Hello,
>>>
>>> While setting this various parameters, I couldn't find documentation and
>>> details about them. Bellow some questions.
>>>
>>> Considering the watchdog module used on a server is set up with a 30s timer
>>> (lets call it the wdt, the "watchdog timer"), how should
>>> "SBD_WATCHDOG_TIMEOUT", "stonith-timeout" and "stonith-watchdog-timeout" be
>> set?
>>>
>>> Here is my thinking so far:
>>>
>>> "SBD_WATCHDOG_TIMEOUT < wdt". The sbd daemon should reset the timer before
>> the
>>> wdt expire so the server stay alive. Online resources and default values are
>>> usually "SBD_WATCHDOG_TIMEOUT=5s" and "wdt=30s". But what if sbd fails to
>> reset
>>> the timer multiple times (eg. because of excessive load, swap storm etc)?
>> The
>>> server will not reset before random*SBD_WATCHDOG_TIMEOUT or wdt, right?
>>>
>>> "stonith-watchdog-timeout > SBD_WATCHDOG_TIMEOUT". I'm not quite sure what is
>>> stonith-watchdog-timeout. Is it the maximum time to wait from stonithd after
>> it
>>> asked for a node fencing before it considers the watchdog was actually
>>> triggered and the node reseted, even with no confirmation? I suppose
>>> "stonith-watchdog-timeout" is mostly useful to stonithd, right?
>>>
>>> "stonith-watchdog-timeout < stonith-timeout". I understand the stonith action
>>> timeout should be at least greater than the wdt so stonithd will not raise a
>>> timeout before the wdt had a chance to exprire and reset the node. Is it
>> right?
>>>
>>> Any other comments?
>>>
>>> Regards,
>>> --
>>> Jehan-Guillaume de Rorthais
>>> Dalibo
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>> --
>>   .~.
>>   /V\
>>  //  \\
>> /(   )\
>> ^`~'^
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-- 
  .~.
  /V\
 //  \\
/(   )\
^`~'^