[Pacemaker] Sporadic problems of rejoin after split brain situation

Sun Mar 18 19:53:11 EDT 2012

Hi all
No, we have no crossover cable between the nodes.
The 2 nodes are linked by a switched network, and this works really fine except 
the mentioned case. 

It's rather easy to reproduce.
1. Activate a 2-node cluster
2. disconnect network connection (e.g by disconnect network adapter on VM)
3. wait until both nodes are active and act as DC
4. reconnect nodes
5. the new DC is elected

When step 4 encounters an unsteady network, sometimes step 5 will never be 
reached and both nodes stays DC. They're are sending and receiving heartbeat 
status messages.

The situation may be simulated by some repeatedly disconnects/connects

Versions:
pacemaker 1.1.6 (and 1.0.5)
heartbeat 3.0.7 (and 3.0.0)

Are there any ideas?
Or have somebody any hints how I could localize or debug this problem.

Thanks!
Kind regards
Roman

The logouts are everytime a little different however 2 examples here 

*** pacemaker 1.0.5 ***
Mar  8 07:05:50 LAB19 heartbeat: [2979]: CRIT: Cluster node lab13 returning 
after partition.
Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: For information on cluster 
partitions, See URL: http://linux-ha.org/SplitBrain
Mar  8 07:05:50 LAB19 heartbeat: [2979]: WARN: Deadtime value may be too small.
Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: See FAQ for information on tuning 
deadtime.
Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: URL: http://linux-
ha.org/FAQ#heavy_load
Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: Link lab13:eth0 up.
Mar  8 07:05:50 LAB19 heartbeat: [2979]: WARN: Late heartbeat: Node lab13: 
interval 244130 ms
Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: Status update for node lab13: 
status active
Mar  8 07:05:50 LAB19 crmd: [3083]: notice: crmd_ha_status_callback: Status 
update: Node lab13 now has status [active] (DC=true)
Mar  8 07:05:50 LAB19 crmd: [3083]: info: crm_update_peer_proc: lab13.ais is now 
online
Mar  8 07:05:50 LAB19 cib: [3079]: WARN: cib_peer_callback: Discarding 
cib_apply_diff message (14a33) from lab13: not in our membership
Mar  8 07:05:50 LAB19 heartbeat: [2979]: info: all clients are now paused
Mar  8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster
Mar  8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm
Mar  8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: no mbr_track info
Mar  8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: Got an event 
OC_EV_MS_NEW_MEMBERSHIP from ccm
Mar  8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: instance=15, 
nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
Mar  8 07:05:51 LAB19 crmd: [3083]: info: crmd_ccm_msg_callback: Quorum 
(re)attained after event=NEW MEMBERSHIP (id=15)
Mar  8 07:05:51 LAB19 crmd: [3083]: info: ccm_event_detail: NEW MEMBERSHIP: 
trans=15, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3
Mar  8 07:05:51 LAB19 crmd: [3083]: info: ccm_event_detail: CURRENT: lab19 
[nodeid=1, born=15]
Mar  8 07:05:51 LAB19 crmd: [3083]: info: populate_cib_nodes_ha: Requesting the 
list of configured nodes
Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm
Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info
Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event 
OC_EV_MS_NEW_MEMBERSHIP from ccm
Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=15, nodes=1, 
new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
Mar  8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM 
event=NEW MEMBERSHIP (id=15)
Mar  8 07:05:51 LAB19 heartbeat: [2979]: info: all clients are now resumed
Mar  8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster
Mar  8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster
Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm
Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info
Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event 
OC_EV_MS_NEW_MEMBERSHIP from ccm
Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=16, nodes=1, 
new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
Mar  8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM 
event=NEW MEMBERSHIP (id=16)
Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm
Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info
Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event 
OC_EV_MS_NEW_MEMBERSHIP from ccm
Mar  8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=17, nodes=1, 
new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
Mar  8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM 
event=NEW MEMBERSHIP (id=17)
Mar  8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm
Mar  8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: no mbr_track info
Mar  8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: Got an event 
OC_EV_MS_NEW_MEMBERSHIP from ccm
Mar  8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: instance=16, 
nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
:

*** pacemaker 1.1.6 ***
Mar 16 01:35:28 oan1 heartbeat: [7556]: CRIT: Cluster node oan3 returning after 
partition.
Mar 16 01:35:28 oan1 heartbeat: [7556]: info: For information on cluster 
partitions, See URL: http://linux-ha.org/wiki/Split_Brain
Mar 16 01:35:28 oan1 heartbeat: [7556]: WARN: Deadtime value may be too small.
Mar 16 01:35:28 oan1 heartbeat: [7556]: info: See FAQ for information on tuning 
deadtime.
Mar 16 01:35:28 oan1 heartbeat: [7556]: info: URL: http://linux-
ha.org/wiki/FAQ#Heavy_Load
Mar 16 01:35:28 oan1 heartbeat: [7556]: info: Link oan3:eth0 up.
Mar 16 01:35:28 oan1 heartbeat: [7556]: WARN: Late heartbeat: Node oan3: 
interval 73300 ms
Mar 16 01:35:28 oan1 heartbeat: [7556]: info: Status update for node oan3: 
status active
Mar 16 01:35:28 oan1 crmd: [7601]: notice: crmd_ha_status_callback: Status 
update: Node oan3 now has status [active]
Mar 16 01:35:28 oan1 crmd: [7601]: info: crm_update_peer_proc: oan3.ais is now 
online
Mar 16 01:35:28 oan1 crmd: [7601]: WARN: match_down_event: No match for shutdown 
action on a9dae67b-b98f-498b-ad1b-950fc73c2ab2
Mar 16 01:35:28 oan1 crmd: [7601]: info: te_update_diff: Stonith/shutdown of 
a9dae67b-b98f-498b-ad1b-950fc73c2ab2 not matched
Mar 16 01:35:28 oan1 crmd: [7601]: info: abort_transition_graph: 
te_update_diff:215 - Triggered transition abort (complete=1, tag=node_state, 
id=a9dae67b-b98f-498b-ad1b-950fc73c2ab2, magic=NA, cib=0.210.50) : Node failure
Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition 
S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: All 1 cluster 
nodes are eligible to run resources.
Mar 16 01:35:28 oan1 crmd: [7601]: info: do_pe_invoke: Query 204: Requesting the 
current CIB: S_POLICY_ENGINE
Mar 16 01:35:28 oan1 crmd: [7601]: info: do_pe_invoke_callback: Invoking the PE: 
query=204, ref=pe_calc-dc-1331858128-123, seq=13, quorate=1
Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave   
rsc_clustercontrol	(Started oan1)
Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave   rscIpaddr2	
(Started oan1)
Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave   rsc_asterisk	
(Started oan1)
Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave   rsc_tomcat5	
(Started oan1)
Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave   
rsc_ntp_refresh	(Started oan1)
Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition 
S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE 
origin=handle_response ]
Mar 16 01:35:28 oan1 crmd: [7601]: info: unpack_graph: Unpacked transition 16: 0 
actions in 0 synapses
Mar 16 01:35:28 oan1 crmd: [7601]: info: do_te_invoke: Processing graph 16 
(ref=pe_calc-dc-1331858128-123) derived from /var/lib/pengine/pe-input-201.bz2
Mar 16 01:35:28 oan1 pengine: [17673]: notice: process_pe_message: Transition 
16: PEngine Input stored in: /var/lib/pengine/pe-input-201.bz2
Mar 16 01:35:28 oan1 crmd: [7601]: info: run_graph: 
====================================================
Mar 16 01:35:28 oan1 crmd: [7601]: notice: run_graph: Transition 16 (Complete=0, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-
201.bz2): Complete
Mar 16 01:35:28 oan1 crmd: [7601]: info: te_graph_trigger: Transition 16 is now 
complete
Mar 16 01:35:28 oan1 crmd: [7601]: info: notify_crmd: Transition 16 status: done 
- <null>
Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition 
S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL 
origin=notify_crmd ]
Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: Starting PEngine 
Recheck Timer
Mar 16 01:35:30 oan1 cib: [7597]: info: cib_process_diff: Diff 0.210.56 -> 
0.210.57 not applied to 0.210.50: current "num_updates" is less than required
Mar 16 01:35:30 oan1 cib: [7597]: WARN: cib_server_process_diff: Not requesting 
full refresh in R/W mode
Mar 16 01:35:30 oan1 ccm: [7596]: info: Break tie for 2 nodes cluster
Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm
Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: no mbr_track info
Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: Got an event 
OC_EV_MS_NEW_MEMBERSHIP from ccm
Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: instance=14, nodes=1, 
new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
Mar 16 01:35:30 oan1 cib: [7597]: info: cib_ccm_msg_callback: Processing CCM 
event=NEW MEMBERSHIP (id=14)
Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm
Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: no mbr_track info
Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: Got an event 
OC_EV_MS_NEW_MEMBERSHIP from ccm
Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: instance=14, nodes=1, 
new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
Mar 16 01:35:30 oan1 crmd: [7601]: info: crmd_ccm_msg_callback: Quorum 
(re)attained after event=NEW MEMBERSHIP (id=14)
Mar 16 01:35:30 oan1 crmd: [7601]: info: ccm_event_detail: NEW MEMBERSHIP: 
trans=14, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3
Mar 16 01:35:30 oan1 crmd: [7601]: info: ccm_event_detail: 	CURRENT: oan1 
[nodeid=0, born=14]
Mar 16 01:35:30 oan1 crmd: [7601]: info: populate_cib_nodes_ha: Requesting the 
list of configured nodes
Mar 16 01:35:31 oan1 ccm: [7596]: info: Break tie for 2 nodes cluster
Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm
Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: no mbr_track info
Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: Got an event 
OC_EV_MS_NEW_MEMBERSHIP from ccm
Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: instance=15, nodes=1, 
new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm
Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: no mbr_track info
Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: Got an event 
OC_EV_MS_NEW_MEMBERSHIP from ccm
Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: instance=15, nodes=1, 
new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
Mar 16 01:35:31 oan1 cib: [7597]: info: cib_ccm_msg_callback: Processing CCM 
event=NEW MEMBERSHIP (id=15)
Mar 16 01:35:31 oan1 crmd: [7601]: info: crmd_ccm_msg_callback: Quorum 
(re)attained after event=NEW MEMBERSHIP (id=15)
Mar 16 01:35:31 oan1 crmd: [7601]: info: ccm_event_detail: NEW MEMBERSHIP: 
trans=15, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3
Mar 16 01:35:31 oan1 crmd: [7601]: info: ccm_event_detail: 	CURRENT: oan1 
[nodeid=0, born=15]
Mar 16 01:35:31 oan1 cib: [7597]: info: cib_process_request: Operation complete: 
op cib_modify for section nodes (origin=local/crmd/205, version=0.210.51): ok 
(rc=0)
Mar 16 01:35:31 oan1 crmd: [7601]: info: populate_cib_nodes_ha: Requesting the 
list of configured nodes