[ClusterLabs] setting up SBD_WATCHDOG_TIMEOUT, stonith-timeout and stonith-watchdog-timeout

Jehan-Guillaume de Rorthais jgdr at dalibo.com
Wed Dec 14 07:26:40 EST 2016

On Thu, 8 Dec 2016 11:47:20 +0100
Jehan-Guillaume de Rorthais <jgdr at dalibo.com> wrote:

> Hello,
> While setting this various parameters, I couldn't find documentation and
> details about them. Bellow some questions.
> Considering the watchdog module used on a server is set up with a 30s timer
> (lets call it the wdt, the "watchdog timer"), how should
> "SBD_WATCHDOG_TIMEOUT", "stonith-timeout" and "stonith-watchdog-timeout" be
> set?
> Here is my thinking so far:
> "SBD_WATCHDOG_TIMEOUT < wdt". The sbd daemon should reset the timer before the
> wdt expire so the server stay alive. Online resources and default values are
> usually "SBD_WATCHDOG_TIMEOUT=5s" and "wdt=30s". But what if sbd fails to
> reset the timer multiple times (eg. because of excessive load, swap storm
> etc)? The server will not reset before random*SBD_WATCHDOG_TIMEOUT or wdt,
> right? 
> "stonith-watchdog-timeout > SBD_WATCHDOG_TIMEOUT". I'm not quite sure what is
> stonith-watchdog-timeout. Is it the maximum time to wait from stonithd after
> it asked for a node fencing before it considers the watchdog was actually
> triggered and the node reseted, even with no confirmation? I suppose
> "stonith-watchdog-timeout" is mostly useful to stonithd, right?
> "stonith-watchdog-timeout < stonith-timeout". I understand the stonith action
> timeout should be at least greater than the wdt so stonithd will not raise a
> timeout before the wdt had a chance to exprire and reset the node. Is it
> right?

Anyone on these questions? I am currently writing some more doc/cookbook for
the PAF project[1], I would prefer being sure of what is written there :)

[1] http://dalibo.github.io/PAF/documentation.html


More information about the Users mailing list