[ClusterLabs] Network split during corosync startup results in split brain

Mon Jul 20 17:36:58 CEST 2015

That's very interesting, and worrying.

Can you send me the full logs please (just the corosync ones if they're
separated, I don't think pacemaker is involved here). If you still have
one node in that state (or can reproduce it) then the output of
corosync-cmapctl on both nodes would also he helpful.

Chrissie

On 20/07/15 16:29, Thomas Meagher wrote:
> Hello,
> 
> Our team has been using corosync + pacemaker successfully for the last
> year or two, but last week ran into an issue which I wanted to get some
> more insight on.  We have a 2 node cluster, using the WaitForAll
> votequorum parameter so all nodes must have been seen at least once
> before resources are started.  We have two layers of fencing configured,
> IPMI and SBD (storage based death, using shared storage).  We have done
> extensive testing on our fencing in the past and it works great, but
> here the fencing never got called.  One of our QA testers managed to
> pull the network cable at a very particular time during startup, and it
> seems to have resulted in corosync telling pacemaker that all nodes had
> been seen, and that the cluster was in a normal state with one node up. 
> No fencing was ever triggered, and all resources were started normally. 
> The other node was NOT marked unclean.  This resulted in a split brain
> scenario, as our master database (pgsql replication) was still running
> as master on the other node, and had now been started and promoted on
> this node.  Luckily this is all in a test environment, so no production
> impact was seen.  Below is test specifics and some relevant logs.
> 
> Procedure:
> 1. Allow both nodes to come up fully.
> 2. Reboot current master node.
> 3. As node is booting up again (during corosync startup), pull
> interconnect cable.
> 
> 
> Expected Behavior:
> 1. Node either a) fails to start any resources or b) fences other node
> and promotes to master
> 
> 
> Actual behavior:
> 1. Node promotes to master without fencing peer, resulting in both nodes
> running master database.
> 
> 
> Module-2 is rebooted @ 12:57:42, and comes back up ~12:59.
> When corosync starts up, both nodes are visible and all vote counts are
> normal.
> 
> Jul 15 12:59:00 module-2 corosync[2906]: [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
> Jul 15 12:59:00 module-2 corosync[2906]: [TOTEM ] A new membership (10.1.1.2:56) was formed. Members joined: 2
> Jul 15 12:59:00 module-2 corosync[2906]: [QUORUM] Waiting for all cluster members. Current votes: 1 expected_votes: 2
> Jul 15 12:59:00 module-2 corosync[2906]: [QUORUM] Members[1]: 2
> Jul 15 12:59:00 module-2 corosync[2906]: [MAIN  ] Completed service synchronization, ready to provide service.
> Jul 15 12:59:06 module-2 pacemakerd[4076]: notice: cluster_connect_quorum: Quorum acquired
> 
> 3 seconds later, the interconnect network cable is pulled.
> 
> Jul 15 12:59:09 module-2 kernel: e1000e: eth3 NIC Link is Down
> 
> Corosync recognizes this immediately, and declares the peer as dead.
> 
> Jul 15 12:59:10 module-2 crmd[4107]: notice: peer_update_callback: Our peer on the DC (module-1) is dead
> 
> Slightly later (very close), corosync initialization completes, it says
> it has quorum, and declares system ready for use.
> 
> Jul 15 12:59:10 module-2 corosync[2906]: [QUORUM] Members[1]: 2
> Jul 15 12:59:10 module-2 corosync[2906]: [MAIN  ] Completed service synchronization, ready to provide service.
> 
> Pacemaker starts resources normally, including Postgres.
> 
> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   fence_sbd        (module-2)
> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   ipmi-1        (module-2)
> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   SlaveIP        (module-2)
> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   postgres:0        (module-2)
> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   ethmonitor:0        (module-2)
> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   tomcat-instance:0        (module-2 - blocked)
> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   ClusterMonitor:0        (module-2 - blocked)
> 
> Votequorum shows 1 vote per node, WaitForAll is set. Pacemaker should
> not be able to start ANY resources until it has seen all nodes once.
> 
> module-2 ~ # corosync-quorumtool 
> Quorum information
> ------------------
> Date:             Wed Jul 15 18:15:34 2015
> Quorum provider:  corosync_votequorum
> Nodes:            1
> Node ID:          2
> Ring ID:          64
> Quorate:          Yes
> 
> Votequorum information
> ----------------------
> Expected votes:   2
> Highest expected: 2
> Total votes:      1
> Quorum:           1  
> Flags:            2Node Quorate WaitForAll 
> 
> Membership information
> ----------------------
>     Nodeid      Votes Name
>          2          1 module-2 (local)
> 
> 
> Package versions:
> 
> -bash-4.3# rpm -qa | grep corosync
> corosynclib-2.3.4-1.fc22.x86_64
> corosync-2.3.4-1.fc22.x86_64
> 
> -bash-4.3# rpm -qa | grep pacemaker
> pacemaker-cluster-libs-1.1.12-2.fc22.x86_64
> pacemaker-libs-1.1.12-2.fc22.x86_64
> pacemaker-cli-1.1.12-2.fc22.x86_64
> pacemaker-1.1.12-2.fc22.x86_64
> 
> 
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>