[ClusterLabs] Network split during corosync startup results in split brain

Tue Jul 28 17:56:56 CEST 2015

We have written an automated test that was able to reproduce this issue, with corosync debug logging.
I once again have two nodes with a split network and both nodes are running even though no fencing was ever initiated.

1.  Bring both nodes up.
2.  'echo b > /proc/sysrq-trigger' to one node (hard reboot)
3.  Wait for node to come back up.
4.  When 'Quorum' first appears in journalctl, wait some period of time between 0.5 - 1.5 seconds, then drop network via 'ifdown ethX'

>From what I see in the logs below, initially the network fault has been detected and Totem says it is unable to form a cluster.
But then I see that crmd "Could not recover from internal error"
A bunch of interrupted system calls, and what looks like a restart of crmd.
Then we actually "Quorum acquired" (right in the middle of all this bad stuff going on!) and "The local CRM is operational"
Then more terminate and recovery operations along with "Fast-tracking shutdown in response to errors", but the reality is that corosync is STILL running on this node.
I even have the system in this state currently.  To me, it appears corosync was forked or restarted and there is some kind of race here.

Relevant logs:

Jul 27 09:07:19 d24-22-right.lab.archivas.com corosync[4629]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: feature_update_callback: Update failed: Timer expired (-62)
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:    error: do_log: FSA: Input I_ERROR from feature_update_callback() received in state S_INTEGRATION
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:  warning: do_state_transition: State transition S_INTEGRATION -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=feature_update_callback ]
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:  warning: do_recover: Fast-tracking shutdown in response to errors
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:  warning: do_election_vote: Not voting in election, we're in state S_RECOVERY
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:    error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: lrm_state_verify_stopped: Stopped 0 recurring operations at shutdown (10 ops remaining)
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: lrm_state_verify_stopped: Recurring action tomcat-instance:695 (tomcat-instance_monitor_10000) incomplete at shutdown
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: lrm_state_verify_stopped: Recurring action postgres:723 (postgres_monitor_5000) incomplete at shutdown
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: lrm_state_verify_stopped: Recurring action fence_sbd:56 (fence_sbd_monitor_10000) incomplete at shutdown
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: lrm_state_verify_stopped: Recurring action MasterIP:724 (MasterIP_monitor_10000) incomplete at shutdown
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: lrm_state_verify_stopped: Recurring action SlaveIP:58 (SlaveIP_monitor_10000) incomplete at shutdown
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: lrm_state_verify_stopped: Recurring action ethmonitor:60 (ethmonitor_monitor_10000) incomplete at shutdown
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: lrm_state_verify_stopped: Recurring action N1F1:718 (N1F1_monitor_10000) incomplete at shutdown
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: lrm_state_verify_stopped: Recurring action N2F1:63 (N2F1_monitor_10000) incomplete at shutdown
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: lrm_state_verify_stopped: Recurring action ipmi-22:57 (ipmi-22_monitor_10000) incomplete at shutdown
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: lrm_state_verify_stopped: Recurring action ClusterMonitor:700 (ClusterMonitor_monitor_10000) incomplete at shutdown
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:    error: lrm_state_verify_stopped: 10 resources were active at shutdown.
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: do_lrm_control: Disconnected from the LRM
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: terminate_cs_connection: Disconnecting from Corosync
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:    error: crmd_fast_exit: Could not recover from internal error
Jul 27 09:07:20 d24-22-right.lab.archivas.com cib[4697]:  warning: qb_ipcs_event_sendv: new_event_notification (4697-4702-11): Broken pipe (32)
Jul 27 09:07:20 d24-22-right.lab.archivas.com cib[4697]:  warning: do_local_notify: A-Sync reply to crmd failed: No message of desired type
Jul 27 09:07:20 d24-22-right.lab.archivas.com pacemakerd[4671]:    error: pcmk_child_exit: The crmd process (4702) exited: Generic Pacemaker error (201)
Jul 27 09:07:20 d24-22-right.lab.archivas.com pacemakerd[4671]:   notice: pcmk_process_exit: Respawning failed child process: crmd
Jul 27 09:07:20 d24-22-right.lab.archivas.com pacemakerd[4671]: child_waitpid: wait(4697) = 0: Interrupted system call (4)
Jul 27 09:07:20 d24-22-right.lab.archivas.com pacemakerd[4671]: child_waitpid: wait(4698) = 0: Interrupted system call (4)
Jul 27 09:07:20 d24-22-right.lab.archivas.com pacemakerd[4671]: child_waitpid: wait(4699) = 0: Interrupted system call (4)
Jul 27 09:07:20 d24-22-right.lab.archivas.com pacemakerd[4671]: child_waitpid: wait(4700) = 0: Interrupted system call (4)
Jul 27 09:07:20 d24-22-right.lab.archivas.com pacemakerd[4671]: child_waitpid: wait(4701) = 0: Interrupted system call (4)
Jul 27 09:07:20 d24-22-right.lab.archivas.com pacemakerd[4671]: child_waitpid: wait(7224) = 0: Success (0)
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[7224]:   notice: main: CRM Git Version: a9c8177
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[7224]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[7224]:   notice: get_node_name: Could not obtain a node name for corosync nodeid 2
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[7224]:   notice: get_node_name: Defaulting to uname -n for the local corosync node name
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[7224]:   notice: cluster_connect_quorum: Quorum acquired
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[7224]:   notice: config_query_callback: Watchdog enabled but stonith-watchdog-timeout is disabled
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[7224]:   notice: get_node_name: Could not obtain a node name for corosync nodeid 1
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[7224]:   notice: get_node_name: Could not obtain a node name for corosync nodeid 1
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[7224]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[1] - state is now member (was (null))
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[7224]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node d24-22-right.lab.archivas.com[2] - state is now member (was (null))
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[7224]:   notice: get_node_name: Defaulting to uname -n for the local corosync node name
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[7224]:   notice: do_started: The local CRM is operational
Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[7224]:   notice: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
Jul 27 09:07:21 d24-22-right.lab.archivas.com corosync[4629]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local
Jul 27 09:07:22 d24-22-right.lab.archivas.com corosync[4629]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local
Jul 27 09:07:24 d24-22-right.lab.archivas.com corosync[4629]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local

Jul 27 09:07:30 d24-22-right.lab.archivas.com crmd[7224]:    error: node_list_update_callback: Node update 3 failed: Timer expired (-62)
Jul 27 09:07:30 d24-22-right.lab.archivas.com crmd[7224]:    error: do_log: FSA: Input I_ERROR from node_list_update_callback() received in state S_PENDING
Jul 27 09:07:30 d24-22-right.lab.archivas.com crmd[7224]:  warning: do_state_transition: State transition S_PENDING -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=node_list_update_callback ]
Jul 27 09:07:30 d24-22-right.lab.archivas.com crmd[7224]:  warning: do_recover: Fast-tracking shutdown in response to errors
Jul 27 09:07:30 d24-22-right.lab.archivas.com crmd[7224]:    error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Jul 27 09:07:30 d24-22-right.lab.archivas.com crmd[7224]:   notice: lrm_state_verify_stopped: Stopped 0 recurring operations at shutdown (0 ops remaining)
Jul 27 09:07:30 d24-22-right.lab.archivas.com crmd[7224]:   notice: do_lrm_control: Disconnected from the LRM
Jul 27 09:07:30 d24-22-right.lab.archivas.com crmd[7224]:   notice: terminate_cs_connection: Disconnecting from Corosync

Despite the fact that the last log message here says "Disconnecting from Corosync", everything is up and running and the other node is marked as pending.

Node d24-22-left.lab.archivas.com (1): pending
Online: [ d24-22-right.lab.archivas.com ]

 fence_sbd	(stonith:fence_sbd):	Started d24-22-right.lab.archivas.com 
 ipmi-22	(stonith:fence_ipmilan):	Started d24-22-right.lab.archivas.com 
 Resource Group: DBMaster
     MasterIP	(ocf::heartbeat:IPaddr2-rhino):	Started d24-22-right.lab.archivas.com 
 Resource Group: DBSlave
     SlaveIP	(ocf::heartbeat:IPaddr2-rhino):	Started d24-22-right.lab.archivas.com 
 Master/Slave Set: postgres-ms [postgres]
     Masters: [ d24-22-right.lab.archivas.com ]
     Stopped: [ d24-22-left.lab.archivas.com ]
 Clone Set: ethmonitor-clone [ethmonitor]
     Started: [ d24-22-right.lab.archivas.com ]
     Stopped: [ d24-22-left.lab.archivas.com ]
 Clone Set: tomcat-clone [tomcat-group]
     Started: [ d24-22-right.lab.archivas.com ]
     Stopped: [ d24-22-left.lab.archivas.com ]
 Resource Group: Node1-IPs
     N1F1	(ocf::heartbeat:IPaddr2-rhino):	Started d24-22-right.lab.archivas.com 
 Resource Group: Node2-IPs
     N2F1	(ocf::heartbeat:IPaddr2-rhino):	Started d24-22-right.lab.archivas.com 

________________________________________
From: Christine Caulfield [ccaulfie at redhat.com]
Sent: Monday, July 20, 2015 11:36
To: users at clusterlabs.org
Subject: Re: [ClusterLabs] Network split during corosync startup results in split brain

That's very interesting, and worrying.

Can you send me the full logs please (just the corosync ones if they're
separated, I don't think pacemaker is involved here). If you still have
one node in that state (or can reproduce it) then the output of
corosync-cmapctl on both nodes would also he helpful.

Chrissie

On 20/07/15 16:29, Thomas Meagher wrote:
> Hello,
>
> Our team has been using corosync + pacemaker successfully for the last
> year or two, but last week ran into an issue which I wanted to get some
> more insight on.  We have a 2 node cluster, using the WaitForAll
> votequorum parameter so all nodes must have been seen at least once
> before resources are started.  We have two layers of fencing configured,
> IPMI and SBD (storage based death, using shared storage).  We have done
> extensive testing on our fencing in the past and it works great, but
> here the fencing never got called.  One of our QA testers managed to
> pull the network cable at a very particular time during startup, and it
> seems to have resulted in corosync telling pacemaker that all nodes had
> been seen, and that the cluster was in a normal state with one node up.
> No fencing was ever triggered, and all resources were started normally.
> The other node was NOT marked unclean.  This resulted in a split brain
> scenario, as our master database (pgsql replication) was still running
> as master on the other node, and had now been started and promoted on
> this node.  Luckily this is all in a test environment, so no production
> impact was seen.  Below is test specifics and some relevant logs.
>
> Procedure:
> 1. Allow both nodes to come up fully.
> 2. Reboot current master node.
> 3. As node is booting up again (during corosync startup), pull
> interconnect cable.
>
>
> Expected Behavior:
> 1. Node either a) fails to start any resources or b) fences other node
> and promotes to master
>
>
> Actual behavior:
> 1. Node promotes to master without fencing peer, resulting in both nodes
> running master database.
>
>
> Module-2 is rebooted @ 12:57:42, and comes back up ~12:59.
> When corosync starts up, both nodes are visible and all vote counts are
> normal.
>
> Jul 15 12:59:00 module-2 corosync[2906]: [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
> Jul 15 12:59:00 module-2 corosync[2906]: [TOTEM ] A new membership (10.1.1.2:56) was formed. Members joined: 2
> Jul 15 12:59:00 module-2 corosync[2906]: [QUORUM] Waiting for all cluster members. Current votes: 1 expected_votes: 2
> Jul 15 12:59:00 module-2 corosync[2906]: [QUORUM] Members[1]: 2
> Jul 15 12:59:00 module-2 corosync[2906]: [MAIN  ] Completed service synchronization, ready to provide service.
> Jul 15 12:59:06 module-2 pacemakerd[4076]: notice: cluster_connect_quorum: Quorum acquired
>
> 3 seconds later, the interconnect network cable is pulled.
>
> Jul 15 12:59:09 module-2 kernel: e1000e: eth3 NIC Link is Down
>
> Corosync recognizes this immediately, and declares the peer as dead.
>
> Jul 15 12:59:10 module-2 crmd[4107]: notice: peer_update_callback: Our peer on the DC (module-1) is dead
>
> Slightly later (very close), corosync initialization completes, it says
> it has quorum, and declares system ready for use.
>
> Jul 15 12:59:10 module-2 corosync[2906]: [QUORUM] Members[1]: 2
> Jul 15 12:59:10 module-2 corosync[2906]: [MAIN  ] Completed service synchronization, ready to provide service.
>
> Pacemaker starts resources normally, including Postgres.
>
> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   fence_sbd        (module-2)
> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   ipmi-1        (module-2)
> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   SlaveIP        (module-2)
> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   postgres:0        (module-2)
> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   ethmonitor:0        (module-2)
> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   tomcat-instance:0        (module-2 - blocked)
> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   ClusterMonitor:0        (module-2 - blocked)
>
> Votequorum shows 1 vote per node, WaitForAll is set. Pacemaker should
> not be able to start ANY resources until it has seen all nodes once.
>
> module-2 ~ # corosync-quorumtool
> Quorum information
> ------------------
> Date:             Wed Jul 15 18:15:34 2015
> Quorum provider:  corosync_votequorum
> Nodes:            1
> Node ID:          2
> Ring ID:          64
> Quorate:          Yes
>
> Votequorum information
> ----------------------
> Expected votes:   2
> Highest expected: 2
> Total votes:      1
> Quorum:           1
> Flags:            2Node Quorate WaitForAll
>
> Membership information
> ----------------------
>     Nodeid      Votes Name
>          2          1 module-2 (local)
>
>
> Package versions:
>
> -bash-4.3# rpm -qa | grep corosync
> corosynclib-2.3.4-1.fc22.x86_64
> corosync-2.3.4-1.fc22.x86_64
>
> -bash-4.3# rpm -qa | grep pacemaker
> pacemaker-cluster-libs-1.1.12-2.fc22.x86_64
> pacemaker-libs-1.1.12-2.fc22.x86_64
> pacemaker-cli-1.1.12-2.fc22.x86_64
> pacemaker-1.1.12-2.fc22.x86_64
>
>
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__clusterlabs.org_mailman_listinfo_users&d=BQICAg&c=DZ-EF4pZfxGSU6MfABwx0g&r=bis4VSCZavhQtfjFuCXVlC6MY8OhonJT1cJm04Er43o&m=raEcoUq946-7Z8PdhsP-o2P28CE2TWEV-mrlLScgezk&s=K55TadxUiZxoM-E_nnrhwefXHKXaIfHh426JWg_PYA8&e=
>
> Project Home: https://urldefense.proofpoint.com/v2/url?u=http-3A__www.clusterlabs.org&d=BQICAg&c=DZ-EF4pZfxGSU6MfABwx0g&r=bis4VSCZavhQtfjFuCXVlC6MY8OhonJT1cJm04Er43o&m=raEcoUq946-7Z8PdhsP-o2P28CE2TWEV-mrlLScgezk&s=veITldAPOQm8Inp4w2UJqhAr-78ZWGYwVx55pCOd5FA&e=
> Getting started: https://urldefense.proofpoint.com/v2/url?u=http-3A__www.clusterlabs.org_doc_Cluster-5Ffrom-5FScratch.pdf&d=BQICAg&c=DZ-EF4pZfxGSU6MfABwx0g&r=bis4VSCZavhQtfjFuCXVlC6MY8OhonJT1cJm04Er43o&m=raEcoUq946-7Z8PdhsP-o2P28CE2TWEV-mrlLScgezk&s=7qFi2U4ok3zqu5221aJ57gdG3O6YByLgxPiiul2Ljn4&e=
> Bugs: https://urldefense.proofpoint.com/v2/url?u=http-3A__bugs.clusterlabs.org&d=BQICAg&c=DZ-EF4pZfxGSU6MfABwx0g&r=bis4VSCZavhQtfjFuCXVlC6MY8OhonJT1cJm04Er43o&m=raEcoUq946-7Z8PdhsP-o2P28CE2TWEV-mrlLScgezk&s=T6D0JDGaqyVXMLHSb0SK3kHz9tCUPpw4zVPOdqc9Tmw&e=
>

_______________________________________________
Users mailing list: Users at clusterlabs.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__clusterlabs.org_mailman_listinfo_users&d=BQICAg&c=DZ-EF4pZfxGSU6MfABwx0g&r=bis4VSCZavhQtfjFuCXVlC6MY8OhonJT1cJm04Er43o&m=raEcoUq946-7Z8PdhsP-o2P28CE2TWEV-mrlLScgezk&s=K55TadxUiZxoM-E_nnrhwefXHKXaIfHh426JWg_PYA8&e=

Project Home: https://urldefense.proofpoint.com/v2/url?u=http-3A__www.clusterlabs.org&d=BQICAg&c=DZ-EF4pZfxGSU6MfABwx0g&r=bis4VSCZavhQtfjFuCXVlC6MY8OhonJT1cJm04Er43o&m=raEcoUq946-7Z8PdhsP-o2P28CE2TWEV-mrlLScgezk&s=veITldAPOQm8Inp4w2UJqhAr-78ZWGYwVx55pCOd5FA&e=
Getting started: https://urldefense.proofpoint.com/v2/url?u=http-3A__www.clusterlabs.org_doc_Cluster-5Ffrom-5FScratch.pdf&d=BQICAg&c=DZ-EF4pZfxGSU6MfABwx0g&r=bis4VSCZavhQtfjFuCXVlC6MY8OhonJT1cJm04Er43o&m=raEcoUq946-7Z8PdhsP-o2P28CE2TWEV-mrlLScgezk&s=7qFi2U4ok3zqu5221aJ57gdG3O6YByLgxPiiul2Ljn4&e=
Bugs: https://urldefense.proofpoint.com/v2/url?u=http-3A__bugs.clusterlabs.org&d=BQICAg&c=DZ-EF4pZfxGSU6MfABwx0g&r=bis4VSCZavhQtfjFuCXVlC6MY8OhonJT1cJm04Er43o&m=raEcoUq946-7Z8PdhsP-o2P28CE2TWEV-mrlLScgezk&s=T6D0JDGaqyVXMLHSb0SK3kHz9tCUPpw4zVPOdqc9Tmw&e=