[ClusterLabs] sub-second error detection and failover possible

Tue Sep 1 15:20:05 UTC 2015

On 09/01/2015 10:07 AM, Digimer wrote:
> On 01/09/15 09:27 AM, Michael Schwartzkopff wrote:
>> Hi,
>>
>> perhaps this question was answered elsewhere, but I count not find any 
>> satisfying answer. So is it possible to set uo a corosync/pacemaker cluster 
>> that detects errors and does the failover in a sub-second time span?
>>
>> if yes, how?
>>
>>
>> Mit freundlichen Grüßen,
>>
>> Michael Schwartzkopff
> 
> Corosync declares a loss of a node, so you would need to start by tuning
> it (token loss timeout and loss count). Of course, as you tighten this
> up, the chances of a transient issue causing false declaration of node
> loss increases.
> 
> Next, you'd need a fence device that can terminate and verify the node's
> termination very, very quickly. I do not know of such a device. Part of
> this is also the time taken for the fence agent to be invoked.
> 
> Last, you'd need to have pacemaker calculate the new desired state and
> make those changes. The services being recovered would need to start
> exceptionally quickly.
> 
> In theory, it's possible I suppose. In practice, very unlikely.

Another consideration: while pacemaker timeouts and intervals can be
specified in milliseconds, internally pacemaker frequently truncates
such values to whole seconds. I wouldn't recommend using anything less
than 2s in any configured value.