[Pacemaker] ifdown ethX + corosync + DRBD = split-brain?

Fri Jul 19 11:38:39 UTC 2013

Hi,

I have been doing some testing of a fairly standard pacemaker/corosync setup with DRBD (with resource-level fencing) and have noticed the following in relation to testing network failures:

- Handling of all ports being blocked is OK, based on hundreds of tests.
- Handling of cable-pulls seems OK, based on only 10 tests.
- ifdown ethX leads to split-brain roughly 50% of the time due to two underlying issues:

1. corosync (possibly by design) handles loss of network interface differently to other network failures. I can only see this from the point of view of the   logs: "[TOTEM ] The network interface is down.", which is different from cable-pull log, where I don't see that message. I'm guessing this as I don't know the code.
2. corosync allows a non-quorate partition, in my case a single node, to update the CIB. This behaviour has been previously confirmed in reply to previous mails on this list and it has been mentioned that there may be improvements in this area in the future. This on its own seems like a bug to me.

My question is: is it possible for me to configure corosync/drbd to handle the ifdown scenario or do I simply have to tell people "do not test with ifdown", as I have seen mentioned in a few places on the web? If I do have to leave out ifdown testing, how can I be sure that I haven't missed out testing some real network failure scenario. 

I don't have the time to do hundreds of cable-pulls, which is what I'm trying to simulate. I will look into introducing failures via the switch, but ideally I'd like to be able to handle ifdown properly or have a clear answer to my problem.

I would really appreciate advice on this as it's a serious issue for me.

Thanks,
Tom