[ClusterLabs] I've been working on a split-brain prevention strategy for 2-node clusters.
eric.robinson at psmnv.com
Sun Oct 9 17:42:50 EDT 2016
Digimer, thanks for your thoughts. Booth is one of the solutions I looked at, but I don't like it because it is complex and difficult to implement (and perhaps costly in terms of AWS services or something similar)). As I read through your comments, I returned again and again to the feeling that the troubles you described do not apply to the deaddrop scenario. Your observations are correct in that you cannot make assumptions about the state of the other node when all coms are down. You cannot count on the other node being in a predictable state. That is certainly true, and it is the very problem that I hope to address with DeadDrop. It provides a last-resort "back channel" for coms between the cluster nodes when all other coms are down, removing the element of assumption.
Consider a few scenarios.
1. Data center A is primary, B is secondary. Coms are lost between A and B, but both of them can still reach the Internet. Node A notices loss of coms with B, but it is already primary so it cares not. Node B sees loss of normal cluster communication, and it might normally think of switching to primary, but first it checks the DeadDrop and it sees a note from A saying, "I'm fine and serving pages for customers." B aborts its plan to become primary. Later, after normal links are restored, B rejoins the cluster still as secondary. There is no element of assumption here.
2. Data center A is primary, B is secondary. A loses communication with the Internet, but not with B. B can still talk to the Internet. B initiates a graceful failover. Again no assumptions.
3. Data center A is primary, B is secondary. Data center A goes completely dark. No communication to anything, not to B, and not to the outside world. B wants to go primary, but first it checks DeadDrop, and it finds that A is not leaving messages there either. It therefore KNOWS that A cannot reach the Internet and is not reachable by customers. No assumptions there. B assumes primary role and customers are happy. When A comes back online, it detects split-brain and refuses to join the cluster, notifying operators. Later, operators manually resolve the split brain.
There is no perfect solution, of course, but is seems to me that this simple approach provides a level of availability beyond what you would normally get with a 2-node cluster. What am I missing?
From: Digimer [mailto:lists at alteeve.ca]
Sent: Sunday, October 09, 2016 2:05 PM
To: Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
Subject: Re: [ClusterLabs] I've been working on a split-brain prevention strategy for 2-node clusters.
On 09/10/16 04:33 PM, Eric Robinson wrote:
> I've been working on a script for preventing split-brain in 2-node
> clusters and I would appreciate comments from everyone. If someone
> already has a solution like this, let me know!
> Most of my database clusters are 2-nodes, with each node in a
> geographically separate data center. Our layout looks like the
> following diagram. Each server node has three physical connections to the world.
> LANs A, B , C, D are all physically separate cable plants and
> cross-connects between the data centers (using different switches,
> routers, power, fiber paths, etc.). This is to ensure maximum cluster
> communication intelligence. LANs A and B (Corosync ring 0) are bonded
> at the NICs, as are LANs C and D (Corosync ring 1).
> Hopefully this diagram will come through intact...
> | |
> | Third party |
> | Web Hosting |
> | |
> | |
> | |
> | |
> | |
> | |
> ++XX |
> XXX XXXXXX+-+XXX
> XX XX XXX
> XXXXXXX XX
> XXXX XX X
> X XXX
> +--------+ The Interwebs XXX+-----+
> | XXX X |
> | XX XX |
> | X XX |
> | X XXXX XXXXXXXXXXX |
> | XXXXXX XX XX |
> | XXXXXXX |
> | |
> | Internet | Internet
> | |
> | |
> | |
> | LAN A |
> | +-----------------------------------+ |
> | | LAN B | |
> | | +---------------------------+ | |
> | | | | | |
> +---+---+---+----+ +-----+---+---+--+
> | | | |
> | Node 1 | | Node 2 |
> | | | |
> +------+---+-----+ +-----+---+------+
> | | LAN C | |
> | +----------------------------+ |
> | LAN D |
> Even with all that connectivity it is possible that something could
> happen to interrupt communication between the 2 data centers, or the
> connectivity been 1 of the data centers and the Internet, and split
> brain would result. I have been working on a way to prevent this using
> a concept I call a "dead drop." This idea takes its name from the spy
> world, where spies cannot communicate directly, but they are able to
> pass simple information and status messages to each other by using a
> blind drop in a previously agreed location. Spy X makes a mark on a
> tree. Later, spy Y comes by and sees the mark, and knows that spy X is
> okay. He leaves a mark of his own on the tree, and later spy X sees it
> and knows that spy Y is okay. Neither spy owns the tree or the land it
> is on.
> The same idea applies here. Suppose all direct TCP/IP connectivity
> were to be severed between Nodes 1 and 2, but both of them are still
> able to reach the Internet. Normally, split brain would result. But
> SUPPOSE they were both running scripts that use curl requests to post
> and retrieve simple status messages to and from a third party web
> host. In other words, even though the nodes cannot talk to each other
> directly, they can still leave messages at a dead drop location for each other to read.
> If Node 2 was in standby mode, normally it would switch to primary.
> However, if it checks the dead drop and sees a message from Node 1
> that says, "I'm still okay and communicating with customers." Then
> Node 2 knows not to become cluster primary. This script could possibly
> be implemented as a cluster resource, with most other resources
> dependent on it.
> The dead drop needs no intelligence other than the ability to read and
> write simple text files, and it can run on any third-party web host
> (or on multiple web sites). It does not fill the role of a quorum or
> arbitrator. The 2 Nodes themselves remain in control of their own
> failover decisions.
> I'm SURE this has been attempted already and I don't want to re-invent
> the wheel, but I have not seen this approach anywhere. Maybe there's a
> good reason for that because it simply won't work? The arbitration
> solutions I have seen all rely on a third machine that plays a complex
> role in arbitration.
The fundamental problem with this approach is that it is predicated on the idea that both nodes are behaving in a predictable way. That is a fatal flaw.
Now, if you truly have redundant interconnects up to the nodes (different conduit, etc), and if you implement proper fencing, then you don't need the avoidable complexity of this "dead-drop" approach.
If a node fails, then the network is up and you can fence the remote node. You can consider this a local cluster in so far as you trust that you have multiple paths to the peer.
If, however, the building is lost (so no fencing works, no comms to the peer), then the cluster will hang because a proper HA system can't make assumptions.
Years ago, I tried to solve a problem very similar to this one. The difference was that, in my case, the nodes had single PSUs and I was trying to buffer against a failed switched PDU, which would take out the node and both fence methods, leaving the cluster hung. I explored solving this problem in a fairly similar way, except I looks at the link state of a third cable connecting directly to the peer (power == link).
So I could then say "if I can't talk to the node and if I can't talk to IPMI and if I can't talk to the PDU *and* the link light was lost, assume the peer is dead".
I eventually abandoned this idea though, because you just can't safely make assumptions about a peer's state safely.
The only geo-located/stretch cluster approach that I've seen that makes any sense and seems genuinely safe is SUSE's 'pacemaker booth' project.
In that case, you basically have a cluster of clusters. Each physical location is a self-contained cluster, and the booth cluster uses an arbiter location. In this way, it is also similar to your dead-drop idea, save for that you have an arbiter node out in the world somewhere.
The logic behind Booth is this;
If I lose contact with my peer, I can safely proceed because the peer site will be in one of two states; 1. It is destroyed. 2. It is alive, but lost access and can be trusted to shut itself down.
The key here is the second part; Because each location is itself a cluster, we can *safely* assume that it will behave properly, if it is alive at all.
So long answer short; Use booth if you really want geo-located HA. I wasted my time, don't waste yours.
Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education?
Users mailing list: Users at clusterlabs.org http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
More information about the Users