[ClusterLabs] Network split during corosync startup results in split brain
Christine Caulfield
ccaulfie at redhat.com
Mon Jul 20 17:36:58 CEST 2015
That's very interesting, and worrying.
Can you send me the full logs please (just the corosync ones if they're
separated, I don't think pacemaker is involved here). If you still have
one node in that state (or can reproduce it) then the output of
corosync-cmapctl on both nodes would also he helpful.
Chrissie
On 20/07/15 16:29, Thomas Meagher wrote:
> Hello,
>
> Our team has been using corosync + pacemaker successfully for the last
> year or two, but last week ran into an issue which I wanted to get some
> more insight on. We have a 2 node cluster, using the WaitForAll
> votequorum parameter so all nodes must have been seen at least once
> before resources are started. We have two layers of fencing configured,
> IPMI and SBD (storage based death, using shared storage). We have done
> extensive testing on our fencing in the past and it works great, but
> here the fencing never got called. One of our QA testers managed to
> pull the network cable at a very particular time during startup, and it
> seems to have resulted in corosync telling pacemaker that all nodes had
> been seen, and that the cluster was in a normal state with one node up.
> No fencing was ever triggered, and all resources were started normally.
> The other node was NOT marked unclean. This resulted in a split brain
> scenario, as our master database (pgsql replication) was still running
> as master on the other node, and had now been started and promoted on
> this node. Luckily this is all in a test environment, so no production
> impact was seen. Below is test specifics and some relevant logs.
>
> Procedure:
> 1. Allow both nodes to come up fully.
> 2. Reboot current master node.
> 3. As node is booting up again (during corosync startup), pull
> interconnect cable.
>
>
> Expected Behavior:
> 1. Node either a) fails to start any resources or b) fences other node
> and promotes to master
>
>
> Actual behavior:
> 1. Node promotes to master without fencing peer, resulting in both nodes
> running master database.
>
>
> Module-2 is rebooted @ 12:57:42, and comes back up ~12:59.
> When corosync starts up, both nodes are visible and all vote counts are
> normal.
>
> Jul 15 12:59:00 module-2 corosync[2906]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
> Jul 15 12:59:00 module-2 corosync[2906]: [TOTEM ] A new membership (10.1.1.2:56) was formed. Members joined: 2
> Jul 15 12:59:00 module-2 corosync[2906]: [QUORUM] Waiting for all cluster members. Current votes: 1 expected_votes: 2
> Jul 15 12:59:00 module-2 corosync[2906]: [QUORUM] Members[1]: 2
> Jul 15 12:59:00 module-2 corosync[2906]: [MAIN ] Completed service synchronization, ready to provide service.
> Jul 15 12:59:06 module-2 pacemakerd[4076]: notice: cluster_connect_quorum: Quorum acquired
>
> 3 seconds later, the interconnect network cable is pulled.
>
> Jul 15 12:59:09 module-2 kernel: e1000e: eth3 NIC Link is Down
>
> Corosync recognizes this immediately, and declares the peer as dead.
>
> Jul 15 12:59:10 module-2 crmd[4107]: notice: peer_update_callback: Our peer on the DC (module-1) is dead
>
> Slightly later (very close), corosync initialization completes, it says
> it has quorum, and declares system ready for use.
>
> Jul 15 12:59:10 module-2 corosync[2906]: [QUORUM] Members[1]: 2
> Jul 15 12:59:10 module-2 corosync[2906]: [MAIN ] Completed service synchronization, ready to provide service.
>
> Pacemaker starts resources normally, including Postgres.
>
> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start fence_sbd (module-2)
> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start ipmi-1 (module-2)
> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start SlaveIP (module-2)
> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start postgres:0 (module-2)
> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start ethmonitor:0 (module-2)
> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start tomcat-instance:0 (module-2 - blocked)
> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start ClusterMonitor:0 (module-2 - blocked)
>
> Votequorum shows 1 vote per node, WaitForAll is set. Pacemaker should
> not be able to start ANY resources until it has seen all nodes once.
>
> module-2 ~ # corosync-quorumtool
> Quorum information
> ------------------
> Date: Wed Jul 15 18:15:34 2015
> Quorum provider: corosync_votequorum
> Nodes: 1
> Node ID: 2
> Ring ID: 64
> Quorate: Yes
>
> Votequorum information
> ----------------------
> Expected votes: 2
> Highest expected: 2
> Total votes: 1
> Quorum: 1
> Flags: 2Node Quorate WaitForAll
>
> Membership information
> ----------------------
> Nodeid Votes Name
> 2 1 module-2 (local)
>
>
> Package versions:
>
> -bash-4.3# rpm -qa | grep corosync
> corosynclib-2.3.4-1.fc22.x86_64
> corosync-2.3.4-1.fc22.x86_64
>
> -bash-4.3# rpm -qa | grep pacemaker
> pacemaker-cluster-libs-1.1.12-2.fc22.x86_64
> pacemaker-libs-1.1.12-2.fc22.x86_64
> pacemaker-cli-1.1.12-2.fc22.x86_64
> pacemaker-1.1.12-2.fc22.x86_64
>
>
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
More information about the Users
mailing list