[Pacemaker] Split-brain after

Dan Frincu df.cluster at gmail.com
Mon Aug 15 04:41:14 EDT 2011

On Thu, Aug 11, 2011 at 8:12 PM, Digimer <linux at alteeve.com> wrote:
> On 08/11/2011 12:58 PM, Alex Forster wrote:
>> I have a two node Pacemaker/Corosync cluster with no resources configured yet.
>> I'm running RHEL 6.1 with the official 1.1.5-5.el6 package.
>> While doing various network configuration, I happened to notice that if I issue
>> a "service network restart" on one node, then approx. four seconds later issue
>> "service network restart" on the second node, the two nodes become split brain,
>> each thinking the other is offline.
>> Obviously, issuing 'service network restarts' four seconds apart will not be a
>> common occurrence in production, but it concerns me that I can 'trick' the nodes
>> into becoming split-brain so easily. Is there some way I can configure Corosync
>> to quickly recover from this scenario?

man corosync.conf
You can increase the value for rrp_problem_count_timeout for this.

              This specifies the time in milliseconds to wait before
decrementing the problem count by 1 for a particular ring to ensure a
link is not marked faulty for tran‐
              sient network failures.

              The default is 2000 milliseconds.

This, however, will cause issues further along the way so you need to
take into consideration the timeouts that resources will have, as well
as monitor operations as to include the added time from modifying this


p.s.: don't mess with rrp_problem_count_threshold unless you also
consider that (rrp_problem_count_threshold *
rrp_token_expired_timeout) < (token - 50ms) => (10 * 47) < (1000 - 50)
=> 470 < 950 (this is the default, changing
rrp_problem_count_threshold to a higher value would also mean changing
the token timeout and/or other parameters, so it would be best to plan

>> Alex
> Configuring fence (stonith) will protect against split-brain by causing
> the remote node to be forced offline (rough, but better than split-brain).
> --
> Digimer
> E-Mail:              digimer at alteeve.com
> Freenode handle:     digimer
> Papers and Projects: http://alteeve.com
> Node Assassin:       http://nodeassassin.org
> "At what point did we forget that the Space Shuttle was, essentially,
> a program that strapped human beings to an explosion and tried to stab
> through the sky with fire and math?"
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Dan Frincu

More information about the Pacemaker mailing list