[ClusterLabs] Recovering after split-brain

Tue Jun 21 04:33:11 UTC 2016

Let me give the full picture about our solution. It will then make it easy
to have the discussion.

We are looking at providing N + 1 Redundancy to our application servers,
i.e. 1 standby for upto N active (currently N<=5). Each server will have
some unique configuration. The standby will store the configuration of all
the active servers such that whichever server goes down, the standby loads
that particular configuration and becomes active. The server that went down
will now become standby.
We have bundled all the configuration that every server has into a resource
such that during failover the resource is moved to the newly active server,
and that way it takes up the personality of the server that went down. To
put it differently, every active server has a 'unique' resource that is
started by Pacemaker whereas standby has none.

Our servers do not write anything to an external database, all the writing
is done to the CIB file under the resource that it is currently managing.
We also have some clients that connect to the active servers (1 client can
connect to only 1 server, 1 server can have multiple clients) and provide
service to end-users. Now the reason I say that split-brain is not an issue
for us, is coz the clients can only connect to 1 of the active servers at
any given time (we have to handle the case that all clients move together
and do not get distributed). So even if two servers become active with same
personality, the clients can only connect to 1 of them. (Initial plan was
to go configure quorum but later I was told that service availability is of
utmost importance and since impact of split-brain is limited, we are
thinking of doing away with it).

Now the concern I have is, once the split is resolved, I would have 2
actives, each having its own view of the resource, trying to synchronize
the CIB. At this point I want the one that has the clients attached to it
win.
I am thinking I can implement a monitor function that can bring down the
resource if it doesn't find any clients attached to it within a given
period of time. But to understand the Pacemaker behavior, what exactly
would happen if the same resource is found to be active on two nodes after
recovery?

-Thanks
Nikhil

On Tue, Jun 21, 2016 at 3:49 AM, Digimer <lists at alteeve.ca> wrote:

> On 20/06/16 05:58 PM, Dimitri Maziuk wrote:
> > On 06/20/2016 03:58 PM, Digimer wrote:
> >
> >> Then wouldn't it be a lot better to just run your services on both nodes
> >> all the time and take HA out of the picture? Availability is predicated
> >> on building the simplest system possible. If you have no concerns about
> >> uncoordinated access, then make like simpler and remove pacemaker
> entirely.
> >
> > Obviously you'd have to remove the other node as well since you now
> > can't have the single service access point anymore.
>
> Nikhil indicated that they could switch where traffic went up-stream
> without issue, if I understood properly.
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20160621/1c4cdd30/attachment-0002.html>