[ClusterLabs] I've been working on a split-brain prevention strategy for 2-node clusters.

Mon Oct 10 10:25:39 EDT 2016

On 10/09/2016 11:02 PM, Digimer wrote:
> On 09/10/16 11:58 PM, Andrei Borzenkov wrote:
>> 10.10.2016 00:42, Eric Robinson пишет:
>>> Digimer, thanks for your thoughts. Booth is one of the solutions I
>>> looked at, but I don't like it because it is complex and difficult to
>>> implement
>>
>> HA is complex. There is no way around it.
>>
>>> (and perhaps costly in terms of AWS services or something
>>> similar)). As I read through your comments, I returned again and
>>> again to the feeling that the troubles you described do not apply to
>>> the deaddrop scenario. Your observations are correct in that you
>>> cannot make assumptions about the state of the other node when all
>>> coms are down.  You cannot count on the other node being in a
>>> predictable state. That is certainly true, and it is the very problem
>>> that I hope to address with DeadDrop. It provides a last-resort "back
>>> channel" for coms between the cluster nodes when all other coms are
>>> down, removing the element of assumption.

The "dead drop" approach you describe is essentially what any
third-party arbitrator does (e.g. a quorum-only node, a booth
arbitrator, or the new qdevice functionality of corosync). Arbitrators
do require a full host rather than just a website, but the added
intelligence is significant.

>>>
>>> Consider a few scenarios.
>>>
>>> 1. Data center A is primary, B is secondary. Coms are lost between A
>>> and B, but both of them can still reach the Internet. Node A notices
>>> loss of coms with B, but it is already primary so it cares not. Node
>>> B sees loss of normal cluster communication, and it might normally
>>> think of switching to primary, but first it checks the DeadDrop and
>>> it sees a note from A saying, "I'm fine and serving pages for
>>> customers." B aborts its plan to become primary. Later, after normal
>>> links are restored, B rejoins the cluster still as secondary. There
>>> is no element of assumption here.
>>>
>>> 2.  Data center A is primary, B is secondary. A loses communication
>>> with the Internet, but not with B. B can still talk to the Internet.
>>> B initiates a graceful failover. Again no assumptions.
>>>
>>> 3. Data center A is primary, B is secondary. Data center A goes
>>> completely dark. No communication to anything, not to B, and not to
>>> the outside world. B wants to go primary, but first it checks
>>> DeadDrop, and it finds that A is not leaving messages there either.
>>> It therefore KNOWS that A cannot reach the Internet and is not
>>> reachable by customers.

This situation is indistinguishable from the scenario where A is mostly
dark (can't communicate with B or the drop site), but still capable of
accessing shared resources. A and B will both go primary (split-brain).
This should be a small risk in your setup, but it is worth considering
what possibilities could lead to it and what negative effects it could have.

>> Depending on your application it still may have active consumers or
>> providers on site A so data on site A and site B can diverge. You need
>> some steps to ensure that site A is really dead. I.e. site A in this
>> case probably needs to commit suicide. This returns us to the same
>> question - to which extent we can trust other side. In practice there
>> are quite a few of HA solutions that rely on suicide in case of
>> communication loss, so it appears to work in real life.
> 
> Not really. It's just that the times it fails is sufficiently small that
> people don't hit it often. Doesn't mean the danger isn't there.
> 
> Consider this; Node stops responding, peer waits, then assumes it's dead
> (failed or suicided) and takes over. Meanwhile, node is hung, not dead.
> It finally recovers and being a machine, doesn't realize time has passed
> (at least not for a short bit). It has no reason to check it's locks or

This shouldn't be a problem with a hardware watchdog.

The only risk in a hardware watchdog setup is that the watchdog itself
has failed at the hardware level, while at the same time, the node is
doing something bad but is still capable of messing with shared resources.

Take a look at sbd for watchdog integration with pacemaker.

> other states, and proceeds as it was before it hung. Depending on what
> it was doing, this could be very bad. Had this been a booth setup, the
> hung node would have been fenced, and the remote side can actually trust
> that this would happen so wouldn't need direct confirmation.
> 
> There are other scenarios, this is just the first one to come to mind.
> 
>>> No assumptions there. B assumes primary role
>>> and customers are happy. When A comes back online, it detects
>>> split-brain and refuses to join the cluster, notifying operators.
>>> Later, operators manually resolve the split brain.
>>>
>>> There is no perfect solution, of course, but is seems to me that this
>>> simple approach provides a level of availability beyond what you
>>> would normally get with a 2-node cluster. What am I missing?
>>>
>>
>> Note that tie breaker solution answers single question - is it safe to
>> take over another node. But there is much more flowing over cluster
>> interconnect, so your cluster is basically frozen - no state change may
>> be allowed. This means you cannot do anything on both sites, and it is
>> absolutely unclear how HA monitor should now behave when it needs to
>> initiate state change, e.g. in response to external events.
>>
>> Unless you again trust other side to stop all services (i.e. - go to
>> known state).