[ClusterLabs] sub-second error detection and failover possible

Ken Gaillot kgaillot at redhat.com
Tue Sep 1 11:20:05 EDT 2015

On 09/01/2015 10:07 AM, Digimer wrote:
> On 01/09/15 09:27 AM, Michael Schwartzkopff wrote:
>> Hi,
>> perhaps this question was answered elsewhere, but I count not find any 
>> satisfying answer. So is it possible to set uo a corosync/pacemaker cluster 
>> that detects errors and does the failover in a sub-second time span?
>> if yes, how?
>> Mit freundlichen Grüßen,
>> Michael Schwartzkopff
> Corosync declares a loss of a node, so you would need to start by tuning
> it (token loss timeout and loss count). Of course, as you tighten this
> up, the chances of a transient issue causing false declaration of node
> loss increases.
> Next, you'd need a fence device that can terminate and verify the node's
> termination very, very quickly. I do not know of such a device. Part of
> this is also the time taken for the fence agent to be invoked.
> Last, you'd need to have pacemaker calculate the new desired state and
> make those changes. The services being recovered would need to start
> exceptionally quickly.
> In theory, it's possible I suppose. In practice, very unlikely.

Another consideration: while pacemaker timeouts and intervals can be
specified in milliseconds, internally pacemaker frequently truncates
such values to whole seconds. I wouldn't recommend using anything less
than 2s in any configured value.

More information about the Users mailing list