[ClusterLabs] Network split during corosync startup results in split brain

Wed Jul 29 09:02:39 CEST 2015

Thomas Meagher napsal(a):
>
> We have written an automated test that was able to reproduce this issue, with corosync debug logging.
> I once again have two nodes with a split network and both nodes are running even though no fencing was ever initiated.
>
> 1.  Bring both nodes up.
> 2.  'echo b > /proc/sysrq-trigger' to one node (hard reboot)
> 3.  Wait for node to come back up.
> 4.  When 'Quorum' first appears in journalctl, wait some period of time between 0.5 - 1.5 seconds, then drop network via 'ifdown ethX'

Don't do ifdown. If you want to simulate node disconnect, block traffic 
with iptables, on the switch or unplug cable. During ifdown corosync 
will rebind to localhost (this is long time known bug) and behaves weirdly.

Regards,
   Honza

>
>  From what I see in the logs below, initially the network fault has been detected and Totem says it is unable to form a cluster.
> But then I see that crmd "Could not recover from internal error"
> A bunch of interrupted system calls, and what looks like a restart of crmd.
> Then we actually "Quorum acquired" (right in the middle of all this bad stuff going on!) and "The local CRM is operational"
> Then more terminate and recovery operations along with "Fast-tracking shutdown in response to errors", but the reality is that corosync is STILL running on this node.
> I even have the system in this state currently.  To me, it appears corosync was forked or restarted and there is some kind of race here.
>
> Relevant logs:
>
> Jul 27 09:07:19 d24-22-right.lab.archivas.com corosync[4629]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: feature_update_callback: Update failed: Timer expired (-62)
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:    error: do_log: FSA: Input I_ERROR from feature_update_callback() received in state S_INTEGRATION
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:  warning: do_state_transition: State transition S_INTEGRATION -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=feature_update_callback ]
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:  warning: do_recover: Fast-tracking shutdown in response to errors
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:  warning: do_election_vote: Not voting in election, we're in state S_RECOVERY
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:    error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: lrm_state_verify_stopped: Stopped 0 recurring operations at shutdown (10 ops remaining)
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: lrm_state_verify_stopped: Recurring action tomcat-instance:695 (tomcat-instance_monitor_10000) incomplete at shutdown
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: lrm_state_verify_stopped: Recurring action postgres:723 (postgres_monitor_5000) incomplete at shutdown
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: lrm_state_verify_stopped: Recurring action fence_sbd:56 (fence_sbd_monitor_10000) incomplete at shutdown
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: lrm_state_verify_stopped: Recurring action MasterIP:724 (MasterIP_monitor_10000) incomplete at shutdown
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: lrm_state_verify_stopped: Recurring action SlaveIP:58 (SlaveIP_monitor_10000) incomplete at shutdown
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: lrm_state_verify_stopped: Recurring action ethmonitor:60 (ethmonitor_monitor_10000) incomplete at shutdown
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: lrm_state_verify_stopped: Recurring action N1F1:718 (N1F1_monitor_10000) incomplete at shutdown
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: lrm_state_verify_stopped: Recurring action N2F1:63 (N2F1_monitor_10000) incomplete at shutdown
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: lrm_state_verify_stopped: Recurring action ipmi-22:57 (ipmi-22_monitor_10000) incomplete at shutdown
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: lrm_state_verify_stopped: Recurring action ClusterMonitor:700 (ClusterMonitor_monitor_10000) incomplete at shutdown
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:    error: lrm_state_verify_stopped: 10 resources were active at shutdown.
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: do_lrm_control: Disconnected from the LRM
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:   notice: terminate_cs_connection: Disconnecting from Corosync
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[4702]:    error: crmd_fast_exit: Could not recover from internal error
> Jul 27 09:07:20 d24-22-right.lab.archivas.com cib[4697]:  warning: qb_ipcs_event_sendv: new_event_notification (4697-4702-11): Broken pipe (32)
> Jul 27 09:07:20 d24-22-right.lab.archivas.com cib[4697]:  warning: do_local_notify: A-Sync reply to crmd failed: No message of desired type
> Jul 27 09:07:20 d24-22-right.lab.archivas.com pacemakerd[4671]:    error: pcmk_child_exit: The crmd process (4702) exited: Generic Pacemaker error (201)
> Jul 27 09:07:20 d24-22-right.lab.archivas.com pacemakerd[4671]:   notice: pcmk_process_exit: Respawning failed child process: crmd
> Jul 27 09:07:20 d24-22-right.lab.archivas.com pacemakerd[4671]: child_waitpid: wait(4697) = 0: Interrupted system call (4)
> Jul 27 09:07:20 d24-22-right.lab.archivas.com pacemakerd[4671]: child_waitpid: wait(4698) = 0: Interrupted system call (4)
> Jul 27 09:07:20 d24-22-right.lab.archivas.com pacemakerd[4671]: child_waitpid: wait(4699) = 0: Interrupted system call (4)
> Jul 27 09:07:20 d24-22-right.lab.archivas.com pacemakerd[4671]: child_waitpid: wait(4700) = 0: Interrupted system call (4)
> Jul 27 09:07:20 d24-22-right.lab.archivas.com pacemakerd[4671]: child_waitpid: wait(4701) = 0: Interrupted system call (4)
> Jul 27 09:07:20 d24-22-right.lab.archivas.com pacemakerd[4671]: child_waitpid: wait(7224) = 0: Success (0)
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[7224]:   notice: main: CRM Git Version: a9c8177
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[7224]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[7224]:   notice: get_node_name: Could not obtain a node name for corosync nodeid 2
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[7224]:   notice: get_node_name: Defaulting to uname -n for the local corosync node name
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[7224]:   notice: cluster_connect_quorum: Quorum acquired
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[7224]:   notice: config_query_callback: Watchdog enabled but stonith-watchdog-timeout is disabled
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[7224]:   notice: get_node_name: Could not obtain a node name for corosync nodeid 1
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[7224]:   notice: get_node_name: Could not obtain a node name for corosync nodeid 1
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[7224]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[1] - state is now member (was (null))
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[7224]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node d24-22-right.lab.archivas.com[2] - state is now member (was (null))
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[7224]:   notice: get_node_name: Defaulting to uname -n for the local corosync node name
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[7224]:   notice: do_started: The local CRM is operational
> Jul 27 09:07:20 d24-22-right.lab.archivas.com crmd[7224]:   notice: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
> Jul 27 09:07:21 d24-22-right.lab.archivas.com corosync[4629]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local
> Jul 27 09:07:22 d24-22-right.lab.archivas.com corosync[4629]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local
> Jul 27 09:07:24 d24-22-right.lab.archivas.com corosync[4629]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local
>
> Jul 27 09:07:30 d24-22-right.lab.archivas.com crmd[7224]:    error: node_list_update_callback: Node update 3 failed: Timer expired (-62)
> Jul 27 09:07:30 d24-22-right.lab.archivas.com crmd[7224]:    error: do_log: FSA: Input I_ERROR from node_list_update_callback() received in state S_PENDING
> Jul 27 09:07:30 d24-22-right.lab.archivas.com crmd[7224]:  warning: do_state_transition: State transition S_PENDING -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=node_list_update_callback ]
> Jul 27 09:07:30 d24-22-right.lab.archivas.com crmd[7224]:  warning: do_recover: Fast-tracking shutdown in response to errors
> Jul 27 09:07:30 d24-22-right.lab.archivas.com crmd[7224]:    error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
> Jul 27 09:07:30 d24-22-right.lab.archivas.com crmd[7224]:   notice: lrm_state_verify_stopped: Stopped 0 recurring operations at shutdown (0 ops remaining)
> Jul 27 09:07:30 d24-22-right.lab.archivas.com crmd[7224]:   notice: do_lrm_control: Disconnected from the LRM
> Jul 27 09:07:30 d24-22-right.lab.archivas.com crmd[7224]:   notice: terminate_cs_connection: Disconnecting from Corosync
>
> Despite the fact that the last log message here says "Disconnecting from Corosync", everything is up and running and the other node is marked as pending.
>
> Node d24-22-left.lab.archivas.com (1): pending
> Online: [ d24-22-right.lab.archivas.com ]
>
>   fence_sbd	(stonith:fence_sbd):	Started d24-22-right.lab.archivas.com
>   ipmi-22	(stonith:fence_ipmilan):	Started d24-22-right.lab.archivas.com
>   Resource Group: DBMaster
>       MasterIP	(ocf::heartbeat:IPaddr2-rhino):	Started d24-22-right.lab.archivas.com
>   Resource Group: DBSlave
>       SlaveIP	(ocf::heartbeat:IPaddr2-rhino):	Started d24-22-right.lab.archivas.com
>   Master/Slave Set: postgres-ms [postgres]
>       Masters: [ d24-22-right.lab.archivas.com ]
>       Stopped: [ d24-22-left.lab.archivas.com ]
>   Clone Set: ethmonitor-clone [ethmonitor]
>       Started: [ d24-22-right.lab.archivas.com ]
>       Stopped: [ d24-22-left.lab.archivas.com ]
>   Clone Set: tomcat-clone [tomcat-group]
>       Started: [ d24-22-right.lab.archivas.com ]
>       Stopped: [ d24-22-left.lab.archivas.com ]
>   Resource Group: Node1-IPs
>       N1F1	(ocf::heartbeat:IPaddr2-rhino):	Started d24-22-right.lab.archivas.com
>   Resource Group: Node2-IPs
>       N2F1	(ocf::heartbeat:IPaddr2-rhino):	Started d24-22-right.lab.archivas.com
>
>
>
>
> ________________________________________
> From: Christine Caulfield [ccaulfie at redhat.com]
> Sent: Monday, July 20, 2015 11:36
> To: users at clusterlabs.org
> Subject: Re: [ClusterLabs] Network split during corosync startup results in split brain
>
> That's very interesting, and worrying.
>
> Can you send me the full logs please (just the corosync ones if they're
> separated, I don't think pacemaker is involved here). If you still have
> one node in that state (or can reproduce it) then the output of
> corosync-cmapctl on both nodes would also he helpful.
>
> Chrissie
>
> On 20/07/15 16:29, Thomas Meagher wrote:
>> Hello,
>>
>> Our team has been using corosync + pacemaker successfully for the last
>> year or two, but last week ran into an issue which I wanted to get some
>> more insight on.  We have a 2 node cluster, using the WaitForAll
>> votequorum parameter so all nodes must have been seen at least once
>> before resources are started.  We have two layers of fencing configured,
>> IPMI and SBD (storage based death, using shared storage).  We have done
>> extensive testing on our fencing in the past and it works great, but
>> here the fencing never got called.  One of our QA testers managed to
>> pull the network cable at a very particular time during startup, and it
>> seems to have resulted in corosync telling pacemaker that all nodes had
>> been seen, and that the cluster was in a normal state with one node up.
>> No fencing was ever triggered, and all resources were started normally.
>> The other node was NOT marked unclean.  This resulted in a split brain
>> scenario, as our master database (pgsql replication) was still running
>> as master on the other node, and had now been started and promoted on
>> this node.  Luckily this is all in a test environment, so no production
>> impact was seen.  Below is test specifics and some relevant logs.
>>
>> Procedure:
>> 1. Allow both nodes to come up fully.
>> 2. Reboot current master node.
>> 3. As node is booting up again (during corosync startup), pull
>> interconnect cable.
>>
>>
>> Expected Behavior:
>> 1. Node either a) fails to start any resources or b) fences other node
>> and promotes to master
>>
>>
>> Actual behavior:
>> 1. Node promotes to master without fencing peer, resulting in both nodes
>> running master database.
>>
>>
>> Module-2 is rebooted @ 12:57:42, and comes back up ~12:59.
>> When corosync starts up, both nodes are visible and all vote counts are
>> normal.
>>
>> Jul 15 12:59:00 module-2 corosync[2906]: [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
>> Jul 15 12:59:00 module-2 corosync[2906]: [TOTEM ] A new membership (10.1.1.2:56) was formed. Members joined: 2
>> Jul 15 12:59:00 module-2 corosync[2906]: [QUORUM] Waiting for all cluster members. Current votes: 1 expected_votes: 2
>> Jul 15 12:59:00 module-2 corosync[2906]: [QUORUM] Members[1]: 2
>> Jul 15 12:59:00 module-2 corosync[2906]: [MAIN  ] Completed service synchronization, ready to provide service.
>> Jul 15 12:59:06 module-2 pacemakerd[4076]: notice: cluster_connect_quorum: Quorum acquired
>>
>> 3 seconds later, the interconnect network cable is pulled.
>>
>> Jul 15 12:59:09 module-2 kernel: e1000e: eth3 NIC Link is Down
>>
>> Corosync recognizes this immediately, and declares the peer as dead.
>>
>> Jul 15 12:59:10 module-2 crmd[4107]: notice: peer_update_callback: Our peer on the DC (module-1) is dead
>>
>> Slightly later (very close), corosync initialization completes, it says
>> it has quorum, and declares system ready for use.
>>
>> Jul 15 12:59:10 module-2 corosync[2906]: [QUORUM] Members[1]: 2
>> Jul 15 12:59:10 module-2 corosync[2906]: [MAIN  ] Completed service synchronization, ready to provide service.
>>
>> Pacemaker starts resources normally, including Postgres.
>>
>> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   fence_sbd        (module-2)
>> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   ipmi-1        (module-2)
>> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   SlaveIP        (module-2)
>> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   postgres:0        (module-2)
>> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   ethmonitor:0        (module-2)
>> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   tomcat-instance:0        (module-2 - blocked)
>> Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   ClusterMonitor:0        (module-2 - blocked)
>>
>> Votequorum shows 1 vote per node, WaitForAll is set. Pacemaker should
>> not be able to start ANY resources until it has seen all nodes once.
>>
>> module-2 ~ # corosync-quorumtool
>> Quorum information
>> ------------------
>> Date:             Wed Jul 15 18:15:34 2015
>> Quorum provider:  corosync_votequorum
>> Nodes:            1
>> Node ID:          2
>> Ring ID:          64
>> Quorate:          Yes
>>
>> Votequorum information
>> ----------------------
>> Expected votes:   2
>> Highest expected: 2
>> Total votes:      1
>> Quorum:           1
>> Flags:            2Node Quorate WaitForAll
>>
>> Membership information
>> ----------------------
>>      Nodeid      Votes Name
>>           2          1 module-2 (local)
>>
>>
>> Package versions:
>>
>> -bash-4.3# rpm -qa | grep corosync
>> corosynclib-2.3.4-1.fc22.x86_64
>> corosync-2.3.4-1.fc22.x86_64
>>
>> -bash-4.3# rpm -qa | grep pacemaker
>> pacemaker-cluster-libs-1.1.12-2.fc22.x86_64
>> pacemaker-libs-1.1.12-2.fc22.x86_64
>> pacemaker-cli-1.1.12-2.fc22.x86_64
>> pacemaker-1.1.12-2.fc22.x86_64
>>
>>
>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__clusterlabs.org_mailman_listinfo_users&d=BQICAg&c=DZ-EF4pZfxGSU6MfABwx0g&r=bis4VSCZavhQtfjFuCXVlC6MY8OhonJT1cJm04Er43o&m=raEcoUq946-7Z8PdhsP-o2P28CE2TWEV-mrlLScgezk&s=K55TadxUiZxoM-E_nnrhwefXHKXaIfHh426JWg_PYA8&e=
>>
>> Project Home: https://urldefense.proofpoint.com/v2/url?u=http-3A__www.clusterlabs.org&d=BQICAg&c=DZ-EF4pZfxGSU6MfABwx0g&r=bis4VSCZavhQtfjFuCXVlC6MY8OhonJT1cJm04Er43o&m=raEcoUq946-7Z8PdhsP-o2P28CE2TWEV-mrlLScgezk&s=veITldAPOQm8Inp4w2UJqhAr-78ZWGYwVx55pCOd5FA&e=
>> Getting started: https://urldefense.proofpoint.com/v2/url?u=http-3A__www.clusterlabs.org_doc_Cluster-5Ffrom-5FScratch.pdf&d=BQICAg&c=DZ-EF4pZfxGSU6MfABwx0g&r=bis4VSCZavhQtfjFuCXVlC6MY8OhonJT1cJm04Er43o&m=raEcoUq946-7Z8PdhsP-o2P28CE2TWEV-mrlLScgezk&s=7qFi2U4ok3zqu5221aJ57gdG3O6YByLgxPiiul2Ljn4&e=
>> Bugs: https://urldefense.proofpoint.com/v2/url?u=http-3A__bugs.clusterlabs.org&d=BQICAg&c=DZ-EF4pZfxGSU6MfABwx0g&r=bis4VSCZavhQtfjFuCXVlC6MY8OhonJT1cJm04Er43o&m=raEcoUq946-7Z8PdhsP-o2P28CE2TWEV-mrlLScgezk&s=T6D0JDGaqyVXMLHSb0SK3kHz9tCUPpw4zVPOdqc9Tmw&e=
>>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__clusterlabs.org_mailman_listinfo_users&d=BQICAg&c=DZ-EF4pZfxGSU6MfABwx0g&r=bis4VSCZavhQtfjFuCXVlC6MY8OhonJT1cJm04Er43o&m=raEcoUq946-7Z8PdhsP-o2P28CE2TWEV-mrlLScgezk&s=K55TadxUiZxoM-E_nnrhwefXHKXaIfHh426JWg_PYA8&e=
>
> Project Home: https://urldefense.proofpoint.com/v2/url?u=http-3A__www.clusterlabs.org&d=BQICAg&c=DZ-EF4pZfxGSU6MfABwx0g&r=bis4VSCZavhQtfjFuCXVlC6MY8OhonJT1cJm04Er43o&m=raEcoUq946-7Z8PdhsP-o2P28CE2TWEV-mrlLScgezk&s=veITldAPOQm8Inp4w2UJqhAr-78ZWGYwVx55pCOd5FA&e=
> Getting started: https://urldefense.proofpoint.com/v2/url?u=http-3A__www.clusterlabs.org_doc_Cluster-5Ffrom-5FScratch.pdf&d=BQICAg&c=DZ-EF4pZfxGSU6MfABwx0g&r=bis4VSCZavhQtfjFuCXVlC6MY8OhonJT1cJm04Er43o&m=raEcoUq946-7Z8PdhsP-o2P28CE2TWEV-mrlLScgezk&s=7qFi2U4ok3zqu5221aJ57gdG3O6YByLgxPiiul2Ljn4&e=
> Bugs: https://urldefense.proofpoint.com/v2/url?u=http-3A__bugs.clusterlabs.org&d=BQICAg&c=DZ-EF4pZfxGSU6MfABwx0g&r=bis4VSCZavhQtfjFuCXVlC6MY8OhonJT1cJm04Er43o&m=raEcoUq946-7Z8PdhsP-o2P28CE2TWEV-mrlLScgezk&s=T6D0JDGaqyVXMLHSb0SK3kHz9tCUPpw4zVPOdqc9Tmw&e=
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>