[Pacemaker] Sporadic problems of rejoin after split brain situation

Mon Mar 19 00:01:15 EDT 2012

On Mon, Mar 19, 2012 at 10:53 AM, Roman Sidler
<roman.sidler at sidler-se.net> wrote:
> Hi all
> No, we have no crossover cable between the nodes.
> The 2 nodes are linked by a switched network, and this works really fine except
> the mentioned case.
>
> It's rather easy to reproduce.
> 1. Activate a 2-node cluster
> 2. disconnect network connection (e.g by disconnect network adapter on VM)
> 3. wait until both nodes are active and act as DC
> 4. reconnect nodes
> 5. the new DC is elected
>
> When step 4 encounters an unsteady network, sometimes step 5 will never be
> reached and both nodes stays DC. They're are sending and receiving heartbeat
> status messages.

The CCM (part of heartbeat) is known not to be particularly awesome in
such scenarios and it appears to be the case again here:

Mar 16 01:35:31 oan1 crmd: [7601]: info: ccm_event_detail: NEW MEMBERSHIP:
trans=15, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3

As far as pacemaker knows, its still all alone.

Perhaps give corosync 1.4.x a go instead of heartbeat?

>
> The situation may be simulated by some repeatedly disconnects/connects
>
> Versions:
> pacemaker 1.1.6 (and 1.0.5)
> heartbeat 3.0.7 (and 3.0.0)
>
> Are there any ideas?
> Or have somebody any hints how I could localize or debug this problem.
>
> Thanks!
> Kind regards
> Roman
>
>
>
> The logouts are everytime a little different however 2 examples here
>
> *** pacemaker 1.0.5 ***
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: CRIT: Cluster node lab13 returning
> after partition.
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: For information on cluster
> partitions, See URL: http://linux-ha.org/SplitBrain
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: WARN: Deadtime value may be too small.
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: See FAQ for information on tuning
> deadtime.
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: URL: http://linux-
> ha.org/FAQ#heavy_load
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: Link lab13:eth0 up.
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: WARN: Late heartbeat: Node lab13:
> interval 244130 ms
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: Status update for node lab13:
> status active
> Mar  8 07:05:50 LAB19 crmd: [3083]: notice: crmd_ha_status_callback: Status
> update: Node lab13 now has status [active] (DC=true)
> Mar  8 07:05:50 LAB19 crmd: [3083]: info: crm_update_peer_proc: lab13.ais is now
> online
> Mar  8 07:05:50 LAB19 cib: [3079]: WARN: cib_peer_callback: Discarding
> cib_apply_diff message (14a33) from lab13: not in our membership
> Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: all clients are now paused
> Mar  8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster
> Mar  8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar  8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: no mbr_track info
> Mar  8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar  8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: instance=15,
> nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar  8 07:05:51 LAB19 crmd: [3083]: info: crmd_ccm_msg_callback: Quorum
> (re)attained after event=NEW MEMBERSHIP (id=15)
> Mar  8 07:05:51 LAB19 crmd: [3083]: info: ccm_event_detail: NEW MEMBERSHIP:
> trans=15, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3
> Mar  8 07:05:51 LAB19 crmd: [3083]: info: ccm_event_detail: CURRENT: lab19
> [nodeid=1, born=15]
> Mar  8 07:05:51 LAB19 crmd: [3083]: info: populate_cib_nodes_ha: Requesting the
> list of configured nodes
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=15, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar  8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM
> event=NEW MEMBERSHIP (id=15)
> Mar  8 07:05:51 LAB19 heartbeat: [2979]: info: all clients are now resumed
> Mar  8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster
> Mar  8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=16, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar  8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM
> event=NEW MEMBERSHIP (id=16)
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=17, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar  8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM
> event=NEW MEMBERSHIP (id=17)
> Mar  8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar  8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: no mbr_track info
> Mar  8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar  8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: instance=16,
> nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> :
>
> *** pacemaker 1.1.6 ***
> Mar 16 01:35:28 oan1 heartbeat: [7556]: CRIT: Cluster node oan3 returning after
> partition.
> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: For information on cluster
> partitions, See URL: http://linux-ha.org/wiki/Split_Brain
> Mar 16 01:35:28 oan1 heartbeat: [7556]: WARN: Deadtime value may be too small.
> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: See FAQ for information on tuning
> deadtime.
> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: URL: http://linux-
> ha.org/wiki/FAQ#Heavy_Load
> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: Link oan3:eth0 up.
> Mar 16 01:35:28 oan1 heartbeat: [7556]: WARN: Late heartbeat: Node oan3:
> interval 73300 ms
> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: Status update for node oan3:
> status active
> Mar 16 01:35:28 oan1 crmd: [7601]: notice: crmd_ha_status_callback: Status
> update: Node oan3 now has status [active]
> Mar 16 01:35:28 oan1 crmd: [7601]: info: crm_update_peer_proc: oan3.ais is now
> online
> Mar 16 01:35:28 oan1 crmd: [7601]: WARN: match_down_event: No match for shutdown
> action on a9dae67b-b98f-498b-ad1b-950fc73c2ab2
> Mar 16 01:35:28 oan1 crmd: [7601]: info: te_update_diff: Stonith/shutdown of
> a9dae67b-b98f-498b-ad1b-950fc73c2ab2 not matched
> Mar 16 01:35:28 oan1 crmd: [7601]: info: abort_transition_graph:
> te_update_diff:215 - Triggered transition abort (complete=1, tag=node_state,
> id=a9dae67b-b98f-498b-ad1b-950fc73c2ab2, magic=NA, cib=0.210.50) : Node failure
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition
> S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
> origin=abort_transition_graph ]
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: All 1 cluster
> nodes are eligible to run resources.
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_pe_invoke: Query 204: Requesting the
> current CIB: S_POLICY_ENGINE
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_pe_invoke_callback: Invoking the PE:
> query=204, ref=pe_calc-dc-1331858128-123, seq=13, quorate=1
> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave
> rsc_clustercontrol      (Started oan1)
> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave   rscIpaddr2
> (Started oan1)
> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave   rsc_asterisk
> (Started oan1)
> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave   rsc_tomcat5
> (Started oan1)
> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave
> rsc_ntp_refresh (Started oan1)
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition
> S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> origin=handle_response ]
> Mar 16 01:35:28 oan1 crmd: [7601]: info: unpack_graph: Unpacked transition 16: 0
> actions in 0 synapses
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_te_invoke: Processing graph 16
> (ref=pe_calc-dc-1331858128-123) derived from /var/lib/pengine/pe-input-201.bz2
> Mar 16 01:35:28 oan1 pengine: [17673]: notice: process_pe_message: Transition
> 16: PEngine Input stored in: /var/lib/pengine/pe-input-201.bz2
> Mar 16 01:35:28 oan1 crmd: [7601]: info: run_graph:
> ====================================================
> Mar 16 01:35:28 oan1 crmd: [7601]: notice: run_graph: Transition 16 (Complete=0,
> Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-
> 201.bz2): Complete
> Mar 16 01:35:28 oan1 crmd: [7601]: info: te_graph_trigger: Transition 16 is now
> complete
> Mar 16 01:35:28 oan1 crmd: [7601]: info: notify_crmd: Transition 16 status: done
> - <null>
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition
> S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL
> origin=notify_crmd ]
> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: Starting PEngine
> Recheck Timer
> Mar 16 01:35:30 oan1 cib: [7597]: info: cib_process_diff: Diff 0.210.56 ->
> 0.210.57 not applied to 0.210.50: current "num_updates" is less than required
> Mar 16 01:35:30 oan1 cib: [7597]: WARN: cib_server_process_diff: Not requesting
> full refresh in R/W mode
> Mar 16 01:35:30 oan1 ccm: [7596]: info: Break tie for 2 nodes cluster
> Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: no mbr_track info
> Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: instance=14, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar 16 01:35:30 oan1 cib: [7597]: info: cib_ccm_msg_callback: Processing CCM
> event=NEW MEMBERSHIP (id=14)
> Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: no mbr_track info
> Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: instance=14, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar 16 01:35:30 oan1 crmd: [7601]: info: crmd_ccm_msg_callback: Quorum
> (re)attained after event=NEW MEMBERSHIP (id=14)
> Mar 16 01:35:30 oan1 crmd: [7601]: info: ccm_event_detail: NEW MEMBERSHIP:
> trans=14, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3
> Mar 16 01:35:30 oan1 crmd: [7601]: info: ccm_event_detail:      CURRENT: oan1
> [nodeid=0, born=14]
> Mar 16 01:35:30 oan1 crmd: [7601]: info: populate_cib_nodes_ha: Requesting the
> list of configured nodes
> Mar 16 01:35:31 oan1 ccm: [7596]: info: Break tie for 2 nodes cluster
> Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: no mbr_track info
> Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: instance=15, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: Got an event
> OC_EV_MS_INVALID from ccm
> Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: no mbr_track info
> Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: Got an event
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: instance=15, nodes=1,
> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Mar 16 01:35:31 oan1 cib: [7597]: info: cib_ccm_msg_callback: Processing CCM
> event=NEW MEMBERSHIP (id=15)
> Mar 16 01:35:31 oan1 crmd: [7601]: info: crmd_ccm_msg_callback: Quorum
> (re)attained after event=NEW MEMBERSHIP (id=15)
> Mar 16 01:35:31 oan1 crmd: [7601]: info: ccm_event_detail: NEW MEMBERSHIP:
> trans=15, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3
> Mar 16 01:35:31 oan1 crmd: [7601]: info: ccm_event_detail:      CURRENT: oan1
> [nodeid=0, born=15]
> Mar 16 01:35:31 oan1 cib: [7597]: info: cib_process_request: Operation complete:
> op cib_modify for section nodes (origin=local/crmd/205, version=0.210.51): ok
> (rc=0)
> Mar 16 01:35:31 oan1 crmd: [7601]: info: populate_cib_nodes_ha: Requesting the
> list of configured nodes
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org