[ClusterLabs] Recovering after split-brain
Dmitri Maziuk
dmitri.maziuk at gmail.com
Mon Jun 20 15:17:11 UTC 2016
On 2016-06-20 09:13, Jehan-Guillaume de Rorthais wrote:
> I've heard multiple time this kind of argument on the field, but soon or later,
> these clusters actually had a split brain scenario with clients connected on
> both side, some very bad corruptions, data lost, etc.
I'm sure it's a very helpful answer but the question was about
suspending pacemaker while I manually fix a problem with the resource.
I too would very much like to know how to get pacemaker to "unmonitor"
my resources and not get in the way while I'm updating and/or fixing them.
In heartbeat mon was a completely separate component that could be moved
out of the way when needed.
In pacemaker I now had to power-cycle the nodes several times because in
a 2-node active/passive cluster without quorum and fencing set up like
- drbd master-slave
- drbd filesystem (colocated and ordered after the master)
- symlink (colocated and ordered after the fs)
- service (colocated and ordered after the symlink)
-- when the service fails to start due to user error, pacemaker fscks up
everything up to and including the master-slave drbd and "clearing"
errors on the service does not fix the symlink and the rest of it. (So
far I've been unable to reliable reproduce it in testing environments,
Murphy made sure it only happens on production clusters.)
Right now it seems to me for drbd split brain I'll have to stop the
cluster on victim node, do manual split brain recovery, and restart the
cluster after sync is complete. Is that correct?
Dimitri
More information about the Users
mailing list