[Pacemaker] Could not connect to the CIB: Remote node did notrespond

Thu Feb 10 16:55:45 EST 2011

Hi,

Now I took one node off by /etc/init.d/heartbeat stop.

With one node arsvr1 online, heartbeat tries to respan crmd, but ends with an error code 2.

Here are the logs:

Feb 10 16:37:10 arsvr1 crmd: [5251]: info: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
Feb 10 16:38:11 arsvr1 crmd: [5251]: info: crm_timer_popped: Election Trigger (I_DC_TIMEOUT) just popped!
Feb 10 16:38:11 arsvr1 crmd: [5251]: WARN: do_log: FSA: Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
Feb 10 16:38:11 arsvr1 crmd: [5251]: info: do_state_transition: State transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT cause=C_TIMER_POPPED origin=crm_timer_popped ]
Feb 10 16:38:11 arsvr1 crmd: [5251]: info: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ]
Feb 10 16:38:11 arsvr1 crmd: [5251]: info: do_te_control: Registering TE UUID: c173d324-3b4f-445b-850f-f3406cc116ac
Feb 10 16:38:11 arsvr1 crmd: [5251]: WARN: cib_client_add_notify_callback: Callback already present
Feb 10 16:38:11 arsvr1 crmd: [5251]: info: set_graph_functions: Setting custom graph functions
Feb 10 16:38:11 arsvr1 crmd: [5251]: info: unpack_graph: Unpacked transition -1: 0 actions in 0 synapses
Feb 10 16:38:11 arsvr1 crmd: [5251]: info: start_subsystem: Starting sub-system "pengine"
Feb 10 16:38:11 arsvr1 pengine: [5253]: info: Invoked: /usr/lib/heartbeat/pengine 
Feb 10 16:38:11 arsvr1 pengine: [5253]: info: main: Starting pengine
Feb 10 16:38:14 arsvr1 crmd: [5251]: info: do_dc_takeover: Taking over DC status for this partition
Feb 10 16:38:14 arsvr1 cib: [5116]: info: cib_process_readwrite: We are now in R/W mode
Feb 10 16:38:14 arsvr1 cib: [5116]: info: cib_process_request: Operation complete: op cib_master for section 'all' (origin=local/crmd/6, version=0.298.3): ok (rc=0)
Feb 10 16:38:14 arsvr1 cib: [5116]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/7, version=0.298.3): ok (rc=0)
Feb 10 16:38:14 arsvr1 cib: [5116]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/9, version=0.298.3): ok (rc=0)
Feb 10 16:38:14 arsvr1 crmd: [5251]: info: join_make_offer: Making join offers based on membership 1
Feb 10 16:38:14 arsvr1 crmd: [5251]: info: do_dc_join_offer_all: join-1: Waiting on 1 outstanding join acks
Feb 10 16:38:14 arsvr1 crmd: [5251]: info: te_connect_stonith: Attempting connection to fencing daemon...
Feb 10 16:38:14 arsvr1 cib: [5116]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/11, version=0.298.3): ok (rc=0)
Feb 10 16:38:15 arsvr1 crmd: [5251]: info: te_connect_stonith: Connected
Feb 10 16:38:15 arsvr1 crmd: [5251]: info: config_query_callback: Checking for expired actions every 900000ms
Feb 10 16:38:15 arsvr1 crmd: [5251]: info: update_dc: Set DC to arsvr1 (3.0.1)
Feb 10 16:38:16 arsvr1 crmd: [5251]: info: do_state_transition: State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ]
Feb 10 16:38:16 arsvr1 crmd: [5251]: info: do_state_transition: All 1 cluster nodes responded to the join offer.
Feb 10 16:38:16 arsvr1 crmd: [5251]: info: do_dc_join_finalize: join-1: Syncing the CIB from arsvr1 to the rest of the cluster
Feb 10 16:38:16 arsvr1 cib: [5116]: info: cib_process_request: Operation complete: op cib_sync for section 'all' (origin=local/crmd/14, version=0.298.3): ok (rc=0)
Feb 10 16:38:16 arsvr1 cib: [5116]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/15, version=0.298.3): ok (rc=0)
Feb 10 16:38:16 arsvr1 crmd: [5251]: info: update_attrd: Connecting to attrd...
Feb 10 16:38:16 arsvr1 cib: [5116]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='arsvr1']/transient_attributes (origin=local/crmd/16, version=0.298.3): ok (rc=0)
Feb 10 16:38:16 arsvr1 crmd: [5251]: info: erase_xpath_callback: Deletion of "//node_state[@uname='arsvr1']/transient_attributes": ok (rc=0)
Feb 10 16:38:16 arsvr1 crmd: [5251]: info: do_dc_join_ack: join-1: Updating node state to member for arsvr1
Feb 10 16:38:16 arsvr1 cib: [5116]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='arsvr1']/lrm (origin=local/crmd/17, version=0.298.4): ok (rc=0)
Feb 10 16:38:16 arsvr1 crmd: [5251]: info: erase_xpath_callback: Deletion of "//node_state[@uname='arsvr1']/lrm": ok (rc=0)
Feb 10 16:38:16 arsvr1 crmd: [5251]: info: do_state_transition: State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ]
Feb 10 16:38:16 arsvr1 crmd: [5251]: info: populate_cib_nodes_ha: Requesting the list of configured nodes
Feb 10 16:38:17 arsvr1 crmd: [5251]: WARN: get_uuid: Could not calculate UUID for arsvr2
Feb 10 16:38:17 arsvr1 crmd: [5251]: WARN: populate_cib_nodes_ha: Node arsvr2: no uuid found
Feb 10 16:38:17 arsvr1 attrd: [5119]: info: attrd_local_callback: Sending full refresh (origin=crmd)
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_state_transition: All 1 cluster nodes are eligible to run resources.
Feb 10 16:38:17 arsvr1 attrd: [5119]: info: attrd_trigger_update: Sending flush op to all hosts for: shutdown (<null>)
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_dc_join_final: Ensuring DC, quorum and node attributes are up-to-date
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: crm_update_quorum: Updating quorum status to true (call=21)
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: abort_transition_graph: do_te_invoke:191 - Triggered transition abort (complete=1) : Peer Cancelled
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_pe_invoke: Query 22: Requesting the current CIB: S_POLICY_ENGINE
Feb 10 16:38:17 arsvr1 cib: [5116]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/19, version=0.298.5): ok (rc=0)
Feb 10 16:38:17 arsvr1 cib: [5116]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/21, version=0.298.5): ok (rc=0)
Feb 10 16:38:17 arsvr1 attrd: [5119]: info: attrd_trigger_update: Sending flush op to all hosts for: terminate (<null>)
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_pe_invoke_callback: Invoking the PE: query=22, ref=pe_calc-dc-1297373897-7, seq=1, quorate=1
Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: unpack_config: On loss of CCM Quorum: Ignore
Feb 10 16:38:17 arsvr1 pengine: [5253]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Feb 10 16:38:17 arsvr1 pengine: [5253]: info: determine_online_status: Node arsvr1 is online
Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: group_print:  Resource Group: MySQLDB
Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: native_print:      fs_mysql#011(ocf::heartbeat:Filesystem):#011Stopped 
Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: native_print:      mysql#011(ocf::heartbeat:mysql):#011Stopped 
Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: clone_print:  Master/Slave Set: ms_drbd_mysql
Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: short_print:      Stopped: [ drbd_mysql:0 drbd_mysql:1 ]
Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: clone_print:  Master/Slave Set: ms_drbd_webfs
Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: short_print:      Stopped: [ drbd_webfs:0 drbd_webfs:1 ]
Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: group_print:  Resource Group: WebServices
Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: native_print:      ip1#011(ocf::heartbeat:IPaddr2):#011Stopped 
Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: native_print:      ip1arp#011(ocf::heartbeat:SendArp):#011Stopped 
Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: native_print:      fs_webfs#011(ocf::heartbeat:Filesystem):#011Stopped 
Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: native_print:      apache2#011(lsb:apache2):#011Stopped 
Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: ms_drbd_mysql: Rolling back scores from fs_mysql
Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: ms_drbd_mysql: Rolling back scores from fs_mysql
Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_color: Resource drbd_mysql:1 cannot run anywhere
Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: ms_drbd_mysql: Rolling back scores from fs_mysql
Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: ms_drbd_mysql: Rolling back scores from fs_mysql
Feb 10 16:38:17 arsvr1 pengine: [5253]: info: master_color: ms_drbd_mysql: Promoted 0 instances of a possible 1 to master
Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: ip1arp: Rolling back scores from fs_webfs
Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: ip1arp: Rolling back scores from ip1
Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: ip1arp: Rolling back scores from fs_mysql
Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: ip1arp: Rolling back scores from ip1
Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: ip1arp: Rolling back scores from ip1
Feb 10 16:38:17 arsvr1 crmd: [5251]: WARN: Managed pengine process 5253 killed by signal 11 [SIGSEGV - Segmentation violation].
Feb 10 16:38:17 arsvr1 crmd: [5251]: ERROR: Managed pengine process 5253 dumped core
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: crmdManagedChildDied: Process pengine:[5253] exited (signal=11, exitcode=0)
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: pe_msg_dispatch: Received HUP from pengine:[5253]
Feb 10 16:38:17 arsvr1 crmd: [5251]: CRIT: pe_connection_destroy: Connection to the Policy Engine failed (pid=5253, uuid=679c316a-ec3c-4344-8b45-47d3e6e73fb0)
Feb 10 16:38:17 arsvr1 attrd: [5119]: info: attrd_ha_callback: flush message from arsvr1
Feb 10 16:38:17 arsvr1 attrd: [5119]: info: attrd_ha_callback: flush message from arsvr1
Feb 10 16:38:17 arsvr1 crmd: [5251]: notice: save_cib_contents: Saved CIB contents after PE crash to /var/lib/pengine/pe-core-679c316a-ec3c-4344-8b45-47d3e6e73fb0.bz2
Feb 10 16:38:17 arsvr1 crmd: [5251]: ERROR: do_log: FSA: Input I_ERROR from save_cib_contents() received in state S_POLICY_ENGINE
Feb 10 16:38:17 arsvr1 ccm: [5115]: info: client (pid=5251) removed from ccm
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=save_cib_contents ]
Feb 10 16:38:17 arsvr1 crmd: [5251]: ERROR: do_recover: Action A_RECOVER (0000000001000000) not supported
Feb 10 16:38:17 arsvr1 crmd: [5251]: WARN: do_election_vote: Not voting in election, we're in state S_RECOVERY
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_dc_release: DC role released
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_te_control: Transitioner is now inactive
Feb 10 16:38:17 arsvr1 cib: [5116]: info: cib_process_readwrite: We are now in R/O mode
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_te_control: Disconnecting STONITH...
Feb 10 16:38:17 arsvr1 heartbeat: [5014]: WARN: Managed /usr/lib/heartbeat/crmd process 5251 exited with return code 2.
Feb 10 16:38:17 arsvr1 cib: [5116]: WARN: send_ipc_message: IPC Channel to 5251 is not connected
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: tengine_stonith_connection_destroy: Fencing daemon disconnected
Feb 10 16:38:17 arsvr1 heartbeat: [5014]: ERROR: Respawning client "/usr/lib/heartbeat/crmd":
Feb 10 16:38:17 arsvr1 cib: [5116]: WARN: send_via_callback_channel: Delivery of reply to client 5251/dffbb159-0075-4af5-9767-eda4efff2658 failed
Feb 10 16:38:17 arsvr1 crmd: [5251]: notice: Not currently connected.
Feb 10 16:38:17 arsvr1 heartbeat: [5014]: info: Starting child client "/usr/lib/heartbeat/crmd" (107,117)
Feb 10 16:38:17 arsvr1 cib: [5116]: WARN: do_local_notify: A-Sync reply to crmd failed: reply failed
Feb 10 16:38:17 arsvr1 crmd: [5251]: ERROR: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Feb 10 16:38:17 arsvr1 heartbeat: [5254]: info: Starting "/usr/lib/heartbeat/crmd" as uid 107  gid 117 (pid 5254)
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_state_transition: State transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL origin=do_recover ]
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_shutdown: All subsystems stopped, continuing
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_lrm_control: Disconnected from the LRM
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_ha_control: Disconnected from Heartbeat
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_cib_control: Disconnecting CIB
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: crmd_cib_connection_destroy: Connection to the CIB terminated...
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_exit: Performing A_EXIT_0 - gracefully exiting the CRMd
Feb 10 16:38:17 arsvr1 crmd: [5251]: ERROR: do_exit: Could not recover from internal error
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: free_mem: Dropping I_PENDING: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_election_vote ]
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: free_mem: Dropping I_RELEASE_SUCCESS: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_dc_release ]
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: free_mem: Dropping I_TERMINATE: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ]
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_exit: [crmd] stopped (2)
Feb 10 16:38:17 arsvr1 crmd: [5254]: info: Invoked: /usr/lib/heartbeat/crmd 
Feb 10 16:38:17 arsvr1 crmd: [5254]: info: main: CRM Hg Version: 042548a451fce8400660f6031f4da6f0223dd5dd
Feb 10 16:38:17 arsvr1 crmd: [5254]: info: crmd_init: Starting crmd
Feb 10 16:38:17 arsvr1 crmd: [5254]: info: G_main_add_SignalHandler: Added signal handler for signal 17
Feb 10 16:38:17 arsvr1 crmd: [5254]: info: do_cib_control: CIB connection established
Feb 10 16:38:17 arsvr1 crmd: [5254]: info: crm_cluster_connect: Connecting to Heartbeat
Feb 10 16:38:18 arsvr1 crmd: [5254]: info: register_heartbeat_conn: Hostname: arsvr1
Feb 10 16:38:18 arsvr1 crmd: [5254]: info: register_heartbeat_conn: UUID: bf0e7394-9684-42b9-893b-5a9a6ecddd7e
Feb 10 16:38:18 arsvr1 crmd: [5254]: info: do_ha_control: Connected to the cluster
Feb 10 16:38:18 arsvr1 crmd: [5254]: info: do_ccm_control: CCM connection established... waiting for first callback
Feb 10 16:38:18 arsvr1 crmd: [5254]: info: do_started: Delaying start, CCM (0000000000100000) not connected
Feb 10 16:38:18 arsvr1 crmd: [5254]: info: crmd_init: Starting crmd's mainloop
Feb 10 16:38:18 arsvr1 crmd: [5254]: info: config_query_callback: Checking for expired actions every 900000ms
Feb 10 16:38:18 arsvr1 crmd: [5254]: notice: crmd_client_status_callback: Status update: Client arsvr1/crmd now has status [online] (DC=false)
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: crm_new_peer: Node 0 is now known as arsvr1
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: crm_update_peer_proc: arsvr1.crmd is now online
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: crmd_client_status_callback: Not the DC
Feb 10 16:38:19 arsvr1 crmd: [5254]: notice: crmd_client_status_callback: Status update: Client arsvr1/crmd now has status [online] (DC=false)
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: crmd_client_status_callback: Not the DC
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: mem_handle_event: instance=1, nodes=1, new=1, lost=0, n_idx=0, new_idx=0, old_idx=3
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: crmd_ccm_msg_callback: Quorum (re)attained after event=NEW MEMBERSHIP (id=1)
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: ccm_event_detail: NEW MEMBERSHIP: trans=1, nodes=1, new=1, lost=0 n_idx=0, new_idx=0, old_idx=3
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: ccm_event_detail: #011CURRENT: arsvr1 [nodeid=0, born=1]
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: ccm_event_detail: #011NEW:     arsvr1 [nodeid=0, born=1]
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: crm_update_peer: Node arsvr1: id=0 state=member (new) addr=(null) votes=-1 born=1 seen=1 proc=00000000000000000000000000000200
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: crm_update_peer_proc: arsvr1.ais is now online
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: do_started: The local CRM is operational
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]

Liang Ma
Contractuel | Consultant | SED Systems Inc. 
Ground Systems Analyst
Agence spatiale canadienne | Canadian Space Agency
6767, Route de l'Aéroport, Longueuil (St-Hubert), QC, Canada, J3Y 8Y9
Tél/Tel : (450) 926-5099 | Téléc/Fax: (450) 926-5083
Courriel/E-mail : [liang.ma at space.gc.ca]
Site web/Web site : [www.space.gc.ca ] 

-----Original Message-----
From: Ma, Liang 
Sent: February 10, 2011 9:08 AM
To: The Pacemaker cluster resource manager
Subject: RE: [Pacemaker] Could not connect to the CIB: Remote node did notrespond

Thanks Andrew.

Yes, cibadmin -Ql works, but cibadmin -Q not.

What is DC?

And here is the logs.

Feb 10 08:57:30 arsvr1 cibadmin: [4264]: info: Invoked: cibadmin -Ql 
Feb 10 08:57:32 arsvr1 cibadmin: [4265]: info: Invoked: cibadmin -Q 
Feb 10 08:58:04 arsvr1 crmd: [960]: info: do_state_transition: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ] 
Feb 10 08:58:04 arsvr1 crmd: [960]: info: do_dc_release: DC role released 
Feb 10 08:58:04 arsvr1 crmd: [960]: info: do_te_control: Transitioner is now inactive 
Feb 10 08:58:08 arsvr1 crmd: [960]: info: update_dc: Set DC to arsvr2 (3.0.1) 
Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_local_callback:Sending full refresh (origin=crmd)
Feb 10 08:58:10 arsvr1 crmd: [960]: info: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ] 
Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_trigger_update:Sending flush op to all hosts for: shutdown (<null>) 
Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_trigger_update:Sending flush op to all hosts for: master-drbd_mysql:0 (<null>) 
Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_trigger_update:Sending flush op to all hosts for: terminate (<null>) 
Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_trigger_update:Sending flush op to all hosts for: master-drbd_webfs:0 (<null>) 
Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_trigger_update:Sending flush op to all hosts for: probe_complete (<null>) 
Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_ha_callback: flush message from arsvr1 
Feb 10 08:58:12 arsvr1 attrd: last message repeated 4 times 
Feb 10 08:58:12 arsvr1 attrd: [959]: info: attrd_ha_callback: flush message from arsvr2 
Feb 10 08:58:12 arsvr1 attrd: [959]: info: attrd_ha_callback:flush message from arsvr2 
Feb 10 08:58:12 arsvr1 crmd: [960]:notice:crmd_client_status_callback: Status update: Client arsvr2/crmd now has status [offline] (DC=false) 
Feb 10 08:58:12 arsvr1 attrd: [959]: info: attrd_ha_callback: flush message from arsvr2 
Feb 10 08:58:12 arsvr1 crmd: [960]: info: crm_update_peer_proc:arsvr2.crmd is now offline
Feb 10 08:58:12 arsvr1 attrd: [959]: info: attrd_ha_callback: flush message from arsvr2 
Feb 10 08:58:12 arsvr1 crmd: [960]: info:crmd_client_status_callback:Got client status callback - our DC is dead 
Feb 10 08:58:12 arsvr1 crmd: [960]: notice:crmd_client_status_callback: Status update: Client arsvr2/crmd now has status [online] (DC=false) 
Feb 10 08:58:12 arsvr1 crmd: [960]: info: crm_update_peer_proc:arsvr2.crmd is now online
Feb 10 08:58:12 arsvr1 crmd: [960]: info: crmd_client_status_callback:Not the DC
Feb 10 08:58:12 arsvr1 crmd: [960]: info: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=crmd_client_status_callback ] 
Feb 10 08:58:12 arsvr1 crmd: [960]: info: update_dc: Unset DC arsvr2 
Feb 10 08:58:12 arsvr1 attrd: [959]: info: attrd_ha_callback: flush message from arsvr2 
Feb 10 08:58:14 arsvr1 heartbeat: [898]: WARN: 1 lost packet(s) for [arsvr2] [131787:131789] 
Feb 10 08:58:14 arsvr1 heartbeat: [898]: info: No pkts missing from arsvr2!

Liang Ma
Contractuel | Consultant | SED Systems Inc. 
Ground Systems Analyst
Agence spatiale canadienne | Canadian Space Agency
6767, Route de l'Aéroport, Longueuil (St-Hubert), QC, Canada, J3Y 8Y9
Tél/Tel : (450) 926-5099 | Téléc/Fax: (450) 926-5083
Courriel/E-mail : [liang.ma at space.gc.ca]
Site web/Web site : [www.space.gc.ca ] 

-----Original Message-----
From: Andrew Beekhof [mailto:andrew at beekhof.net] 
Sent: February 10, 2011 2:39 AM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Could not connect to the CIB: Remote node did notrespond

On Wed, Feb 9, 2011 at 3:59 PM,  <Liang.Ma at asc-csa.gc.ca> wrote:
> Hi There,
>
> After a network and power shutdown, my LAMP cluster servers were totally screwed up.
>
> Now crm status gives me
>
> crm status
> ============
> Last updated: Wed Feb  9 09:44:17 2011
> Stack: Heartbeat
> Current DC: arsvr2 (bc6bf61d-6b5f-4307-85f3-bf7bb11531bb) - partition with quorum
> Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
> 2 Nodes configured, 1 expected votes
> 4 Resources configured.
> ============
>
> Online: [ arsvr1 arsvr2 ]
>
> None of the resources comes up.
>
> First I found a brain split in drbd disks. I fixed that and the drbd disks are health. I can mount them manually without problem.
>
> However if I try anything to bring up a resource or edit cib or even a query, it gives me errors as following
>
> crm resource start fs_mysql
> Call cib_replace failed (-41): Remote node did not respond <null>
>
> crm configure edit
> Could not connect to the CIB: Remote node did not respond
> ERROR: creating tmp shadow __crmshell.2540 failed
>
>
> cibadmin -Q
> Call cib_query failed (-41): Remote node did not respond <null>
>
> Any idea what I can do to bring the cluster back?

Seems like you don't have a DC.
Hard to say why without logs.

Does cibadmin -Ql work?

_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker