[ClusterLabs] sub-second error detection and failover possible

Tue Sep 1 15:07:29 UTC 2015

On 01/09/15 09:27 AM, Michael Schwartzkopff wrote:
> Hi,
> 
> perhaps this question was answered elsewhere, but I count not find any 
> satisfying answer. So is it possible to set uo a corosync/pacemaker cluster 
> that detects errors and does the failover in a sub-second time span?
> 
> if yes, how?
> 
> 
> Mit freundlichen Grüßen,
> 
> Michael Schwartzkopff

Corosync declares a loss of a node, so you would need to start by tuning
it (token loss timeout and loss count). Of course, as you tighten this
up, the chances of a transient issue causing false declaration of node
loss increases.

Next, you'd need a fence device that can terminate and verify the node's
termination very, very quickly. I do not know of such a device. Part of
this is also the time taken for the fence agent to be invoked.

Last, you'd need to have pacemaker calculate the new desired state and
make those changes. The services being recovered would need to start
exceptionally quickly.

In theory, it's possible I suppose. In practice, very unlikely.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?