[ClusterLabs] I've been working on a split-brain prevention strategy for 2-node clusters.

Sun Oct 9 17:04:49 EDT 2016

On 09/10/16 04:33 PM, Eric Robinson wrote:
> I’ve been working on a script for preventing split-brain in 2-node
> clusters and I would appreciate comments from everyone. If someone
> already has a solution like this, let me know!
>  
> Most of my database clusters are 2-nodes, with each node in a
> geographically separate data center. Our layout looks like the following
> diagram. Each server node has three physical connections to the world.
> LANs A, B , C, D are all physically separate cable plants and
> cross-connects between the data centers (using different switches,
> routers, power, fiber paths, etc.). This is to ensure maximum cluster
> communication intelligence. LANs A and B (Corosync ring 0) are bonded at
> the NICs, as are LANs C and D (Corosync ring 1). 
>  
> Hopefully this diagram will come through intact…
>  
>  
>  
>                      +----------------+
>                      |                |
>                      |   Third party  |
>                      |  Web Hosting   |
>                      +---+--------+---+
>                          |        |
>                          |        |
>                          |        |
>                          |        |
>                          |        |
>                          |        |
>                          ++XX     |
>                         XXX XXXXXX+-+XXX
>                        XX     XX       XXX
>                  XXXXXXX                 XX
>              XXXX     XX                  X
>              X                          XXX
>     +--------+        The Interwebs    XXX+-----+
>     |        XXX                         X      |
>     |         XX                         XX     |
>     |         X                          XX     |
>     |          X    XXXX       XXXXXXXXXXX      |
>     |          XXXXXX  XX     XX                |
>     |                   XXXXXXX                 |
>     |                                           |
>     | Internet                                  | Internet
>     |                                           |
>     |                                           |
>     |                                           |
>     |                 LAN A                     |
>     |   +-----------------------------------+   |
>     |   |             LAN B                 |   |
>     |   |   +---------------------------+   |   |
>     |   |   |                           |   |   |
> +---+---+---+----+                +-----+---+---+--+
> |                |                |                |
> |   Node 1       |                |   Node 2       |
> |                |                |                |
> +------+---+-----+                +-----+---+------+
>        |   |          LAN C             |   |
>        |   +----------------------------+   |
>        |              LAN D                 |
>        +------------------------------------+
>  
>  
>  
> Even with all that connectivity it is possible that something could
> happen to interrupt communication between the 2 data centers, or the
> connectivity been 1 of the data centers and the Internet, and split
> brain would result. I have been working on a way to prevent this using a
> concept I call a “dead drop.” This idea takes its name from the spy
> world, where spies cannot communicate directly, but they are able to
> pass simple information and status messages to each other by using a
> blind drop in a previously agreed location. Spy X makes a mark on a
> tree. Later, spy Y comes by and sees the mark, and knows that spy X is
> okay. He leaves a mark of his own on the tree, and later spy X sees it
> and knows that spy Y is okay. Neither spy owns the tree or the land it
> is on.
>  
> The same idea applies here. Suppose all direct TCP/IP connectivity were
> to be severed between Nodes 1 and 2, but both of them are still able to
> reach the Internet. Normally, split brain would result. But SUPPOSE they
> were both running scripts that use curl requests to post and retrieve
> simple status messages to and from a third party web host. In other
> words, even though the nodes cannot talk to each other directly, they
> can still leave messages at a dead drop location for each other to read.
> If Node 2 was in standby mode, normally it would switch to primary.
> However, if it checks the dead drop and sees a message from Node 1 that
> says, “I’m still okay and communicating with customers.” Then Node 2
> knows not to become cluster primary. This script could possibly be
> implemented as a cluster resource, with most other resources dependent
> on it.
>  
> The dead drop needs no intelligence other than the ability to read and
> write simple text files, and it can run on any third-party web host (or
> on multiple web sites). It does not fill the role of a quorum or
> arbitrator. The 2 Nodes themselves remain in control of their own
> failover decisions.
>  
> I’m SURE this has been attempted already and I don’t want to re-invent
> the wheel, but I have not seen this approach anywhere. Maybe there’s a
> good reason for that because it simply won’t work? The arbitration
> solutions I have seen all rely on a third machine that plays a complex
> role in arbitration.
>  
> Thoughts?

The fundamental problem with this approach is that it is predicated on
the idea that both nodes are behaving in a predictable way. That is a
fatal flaw.

Now, if you truly have redundant interconnects up to the nodes
(different conduit, etc), and if you implement proper fencing, then you
don't need the avoidable complexity of this "dead-drop" approach.

If a node fails, then the network is up and you can fence the remote
node. You can consider this a local cluster in so far as you trust that
you have multiple paths to the peer.

If, however, the building is lost (so no fencing works, no comms to the
peer), then the cluster will hang because a proper HA system can't make
assumptions.

Years ago, I tried to solve a problem very similar to this one. The
difference was that, in my case, the nodes had single PSUs and I was
trying to buffer against a failed switched PDU, which would take out the
node and both fence methods, leaving the cluster hung. I explored
solving this problem in a fairly similar way, except I looks at the link
state of a third cable connecting directly to the peer (power == link).
So I could then say "if I can't talk to the node and if I can't talk to
IPMI and if I can't talk to the PDU *and* the link light was lost,
assume the peer is dead".

https://github.com/digimer/fence_passive_nic

I eventually abandoned this idea though, because you just can't safely
make assumptions about a peer's state safely.

The only geo-located/stretch cluster approach that I've seen that makes
any sense and seems genuinely safe is SUSE's 'pacemaker booth' project.
In that case, you basically have a cluster of clusters. Each physical
location is a self-contained cluster, and the booth cluster uses an
arbiter location. In this way, it is also similar to your dead-drop
idea, save for that you have an arbiter node out in the world somewhere.

The logic behind Booth is this;

If I lose contact with my peer, I can safely proceed because the peer
site will be in one of two states; 1. It is destroyed. 2. It is alive,
but lost access and can be trusted to shut itself down.

The key here is the second part; Because each location is itself a
cluster, we can *safely* assume that it will behave properly, if it is
alive at all.

So long answer short; Use booth if you really want geo-located HA. I
wasted my time, don't waste yours.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?