[ClusterLabs] Recovering after split-brain

Mon Jun 20 15:17:11 UTC 2016

On 2016-06-20 09:13, Jehan-Guillaume de Rorthais wrote:

> I've heard multiple time this kind of argument on the field, but soon or later,
> these clusters actually had a split brain scenario with clients connected on
> both side, some very bad corruptions, data lost, etc.

I'm sure it's a very helpful answer but the question was about 
suspending pacemaker while I manually fix a problem with the resource.

I too would very much like to know how to get pacemaker to "unmonitor" 
my resources and not get in the way while I'm updating and/or fixing them.

In heartbeat mon was a completely separate component that could be moved 
out of the way when needed.

In pacemaker I now had to power-cycle the nodes several times because in 
a 2-node active/passive cluster without quorum and fencing set up like
- drbd master-slave
- drbd filesystem (colocated and ordered after the master)
- symlink (colocated and ordered after the fs)
- service (colocated and ordered after the symlink)
-- when the service fails to start due to user error, pacemaker fscks up 
everything up to and including the master-slave drbd and "clearing" 
errors on the service does not fix the symlink and the rest of it. (So 
far I've been unable to reliable reproduce it in testing environments, 
Murphy made sure it only happens on production clusters.)

Right now it seems to me for drbd split brain I'll have to stop the 
cluster on victim node, do manual split brain recovery, and restart the 
cluster after sync is complete. Is that correct?

Dimitri