[ClusterLabs] Antw: I've been working on a split-brain prevention strategy for 2-node clusters.
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Mon Oct 10 06:29:57 UTC 2016
>>> Eric Robinson <eric.robinson at psmnv.com> schrieb am 09.10.2016 um 22:33 in
Nachricht
<DM5PR03MB2729630C1A1453E4EB61ED8AFAD80 at DM5PR03MB2729.namprd03.prod.outlook.com>
> I've been working on a script for preventing split-brain in 2-node clusters and
> I would appreciate comments from everyone. If someone already has a solution
> like this, let me know!
Hi!
I'd try to prevent working on the wrong problem: If you have FC in addition to LAN (assuming you don't do FCoverIP on those LANs) I'd strongly suggest to use SBD and fencing, or using a third "whitness node" for quorum. It's not obvious to me what problem you are really trying to solve.
>
> Most of my database clusters are 2-nodes, with each node in a geographically
> separate data center. Our layout looks like the following diagram. Each
> server node has three physical connections to the world. LANs A, B , C, D are
> all physically separate cable plants and cross-connects between the data
> centers (using different switches, routers, power, fiber paths, etc.). This
> is to ensure maximum cluster communication intelligence. LANs A and B
> (Corosync ring 0) are bonded at the NICs, as are LANs C and D (Corosync ring
> 1).
>
> Hopefully this diagram will come through intact...
>
>
>
> +----------------+
> | |
> | Third party |
> | Web Hosting |
> +---+--------+---+
> | |
> | |
> | |
> | |
> | |
> | |
> ++XX |
> XXX XXXXXX+-+XXX
> XX XX XXX
> XXXXXXX XX
> XXXX XX X
> X XXX
> +--------+ The Interwebs XXX+-----+
> | XXX X |
> | XX XX |
> | X XX |
> | X XXXX XXXXXXXXXXX |
> | XXXXXX XX XX |
> | XXXXXXX |
> | |
> | Internet | Internet
> | |
> | |
> | |
> | LAN A |
> | +-----------------------------------+ |
> | | LAN B | |
> | | +---------------------------+ | |
> | | | | | |
> +---+---+---+----+ +-----+---+---+--+
> | | | |
> | Node 1 | | Node 2 |
> | | | |
> +------+---+-----+ +-----+---+------+
> | | LAN C | |
> | +----------------------------+ |
> | LAN D |
> +------------------------------------+
>
>
>
> Even with all that connectivity it is possible that something could happen
> to interrupt communication between the 2 data centers, or the connectivity
> been 1 of the data centers and the Internet, and split brain would result. I
> have been working on a way to prevent this using a concept I call a "dead
> drop." This idea takes its name from the spy world, where spies cannot
> communicate directly, but they are able to pass simple information and status
> messages to each other by using a blind drop in a previously agreed location.
> Spy X makes a mark on a tree. Later, spy Y comes by and sees the mark, and
> knows that spy X is okay. He leaves a mark of his own on the tree, and later
> spy X sees it and knows that spy Y is okay. Neither spy owns the tree or the
> land it is on.
>
> The same idea applies here. Suppose all direct TCP/IP connectivity were to
> be severed between Nodes 1 and 2, but both of them are still able to reach
> the Internet. Normally, split brain would result. But SUPPOSE they were both
> running scripts that use curl requests to post and retrieve simple status
> messages to and from a third party web host. In other words, even though the
> nodes cannot talk to each other directly, they can still leave messages at a
> dead drop location for each other to read. If Node 2 was in standby mode,
> normally it would switch to primary. However, if it checks the dead drop and
> sees a message from Node 1 that says, "I'm still okay and communicating with
> customers." Then Node 2 knows not to become cluster primary. This script
> could possibly be implemented as a cluster resource, with most other
> resources dependent on it.
>
> The dead drop needs no intelligence other than the ability to read and write
> simple text files, and it can run on any third-party web host (or on multiple
> web sites). It does not fill the role of a quorum or arbitrator. The 2 Nodes
> themselves remain in control of their own failover decisions.
>
> I'm SURE this has been attempted already and I don't want to re-invent the
> wheel, but I have not seen this approach anywhere. Maybe there's a good
> reason for that because it simply won't work? The arbitration solutions I
> have seen all rely on a third machine that plays a complex role in
> arbitration.
>
> Thoughts?
>
> --
> Eric Robinson
More information about the Users
mailing list