[ClusterLabs] [Linux-ha-dev] hello, i have a question about heartbeat, can somebody give me a reply? thank you

Fri Oct 16 13:31:13 UTC 2015

You may prefer to use the clusterlabs mailing list. This list is being
phased out.

On 16/10/15 05:20 AM, Shilu wrote:
> When master is down, can the time of switching backup to master be
> shorter than a second?

Not safely, no.

In HA, if a node is declared dead, it needs to be fenced/stonith'ed
before its services are recovered. Not doing this can lead to a
split-brain. The process of fencing a node takes time; Exactly how much
depends on the device or method you are using. IPMI fencing, one of the
most common types, takes a few seconds.

Also, if you shorten the time it takes to declare a node dead, you
increase the chance of having a node declared dead when it's not.

> The following is the switching time about keepalived. Can Heartbeat be
> more excellent?

Heartbeat is long deprecated. The modern stack is Corosync + Pacemaker.

Here is why:

https://alteeve.ca/w/History_of_HA_Clustering

> No, it can't be shorter than 1 second. This is a VRRP protocol
> limitation. Most enterprise-class VRRP implementation are using a BFD
> protocol to achieve sub-second fault detection time. I've created
> proof-of-concept BFD subsystem for keepalived some time ago -
> https://github.com/ivoronin/keepalived/tree/bfd . Unfortunately it is
> not well tested and not suitable for production use.

I've never been a big fan of keepalived because it does not fence. It
assumes that the peer is dead, and when people test it, they kill the
node so in those cases it was a safe assumption. In the real world
though, losing access to a node is no guarantee that it is actually
failed. So people think they're safe, until they're not.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?