[Pacemaker] How to avoid or automatically resolve Split-Brain issue of DRBD

Digimer lists at alteeve.ca
Wed Aug 28 13:55:20 EDT 2013

On 28/08/13 13:13, Xiaomin Zhang wrote:
> Hi, Gurus:
> I've a simple master-slave setup for a mirrored DRBD storage: This
> storage is written by a daemon Java application server to produce
> transaction data.
> node Lhs072gkz \
>          attributes standby="on"
> node Lpplj9jb4
> node Lvoim0kaw
> primitive drbd1 ocf:linbit:drbd \
>          params drbd_resource="r0" \
>          op monitor interval="15s"
> ms ms_drbd1 drbd1 \
>          meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true" target-role="Started"
> location drbd-fence-by-handler-ms_drbd1 ms_drbd1 \
>          rule $id="drbd-fence-by-handler-rule-ms_drbd1" $role="Master"
> -inf: #uname ne
> Lpplj9jb4
> It seems Split-Brains is very likely to happen when I reboot the slave
> machine even the Java application is just writing nothing on the DRBD
> storage.
> Is this an expected behavior?
> And I found some topics about automatically recover from Split-Brain for
> DRBD () It just says to put some configurations in DRBD, all things
> should work. Is this a good practice?
> Thanks.

No, split-brains are not at all expected behaviour, but they happen when 
things are not setup properly.

The best thing to do is to avoid a split-brain in the first place, which 
is easy to do if you setup (working) stonith/fencing.

If you configure stonith in pacemaker using IPMI (the most common 
method) and test it to make sure nodes reboot on failure, you can then 
"hook" drbd into pacemaker's fencing. You do this by setting the fence 
policy to "resource-and-stonith" and then tell DRBD to use the 
"crm-fence-peer.sh" fence handler.

This tells DRBD that, if the peer fails (or vanishes), to block IO and 
call a fence. The fence handler is then invoked which calls pacemaker 
and says "please fence node X". When pacemaker succeeds, it will tell 
the handler which in turn tells DRBD that it's now safe to resume IO. 
One of the nodes will be dead so you will avoid the split-brain in the 
first place.

If your servers have IPMI, iLO, iDRAC, RSA, etc, you can use the 
'fence_ipmilan' fence agent in your pacemaker configuration. If you need 
help with this, just say.



Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?

More information about the Pacemaker mailing list