[Pacemaker] Could not connect to the CIB: Remote node did notrespond

Thu Feb 10 09:07:57 EST 2011

Thanks Andrew.

Yes, cibadmin -Ql works, but cibadmin -Q not.

What is DC?

And here is the logs.

Feb 10 08:57:30 arsvr1 cibadmin: [4264]: info: Invoked: cibadmin -Ql 
Feb 10 08:57:32 arsvr1 cibadmin: [4265]: info: Invoked: cibadmin -Q 
Feb 10 08:58:04 arsvr1 crmd: [960]: info: do_state_transition: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ] 
Feb 10 08:58:04 arsvr1 crmd: [960]: info: do_dc_release: DC role released 
Feb 10 08:58:04 arsvr1 crmd: [960]: info: do_te_control: Transitioner is now inactive 
Feb 10 08:58:08 arsvr1 crmd: [960]: info: update_dc: Set DC to arsvr2 (3.0.1) 
Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_local_callback:Sending full refresh (origin=crmd)
Feb 10 08:58:10 arsvr1 crmd: [960]: info: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ] 
Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_trigger_update:Sending flush op to all hosts for: shutdown (<null>) 
Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_trigger_update:Sending flush op to all hosts for: master-drbd_mysql:0 (<null>) 
Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_trigger_update:Sending flush op to all hosts for: terminate (<null>) 
Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_trigger_update:Sending flush op to all hosts for: master-drbd_webfs:0 (<null>) 
Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_trigger_update:Sending flush op to all hosts for: probe_complete (<null>) 
Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_ha_callback: flush message from arsvr1 
Feb 10 08:58:12 arsvr1 attrd: last message repeated 4 times 
Feb 10 08:58:12 arsvr1 attrd: [959]: info: attrd_ha_callback: flush message from arsvr2 
Feb 10 08:58:12 arsvr1 attrd: [959]: info: attrd_ha_callback:flush message from arsvr2 
Feb 10 08:58:12 arsvr1 crmd: [960]:notice:crmd_client_status_callback: Status update: Client arsvr2/crmd now has status [offline] (DC=false) 
Feb 10 08:58:12 arsvr1 attrd: [959]: info: attrd_ha_callback: flush message from arsvr2 
Feb 10 08:58:12 arsvr1 crmd: [960]: info: crm_update_peer_proc:arsvr2.crmd is now offline
Feb 10 08:58:12 arsvr1 attrd: [959]: info: attrd_ha_callback: flush message from arsvr2 
Feb 10 08:58:12 arsvr1 crmd: [960]: info:crmd_client_status_callback:Got client status callback - our DC is dead 
Feb 10 08:58:12 arsvr1 crmd: [960]: notice:crmd_client_status_callback: Status update: Client arsvr2/crmd now has status [online] (DC=false) 
Feb 10 08:58:12 arsvr1 crmd: [960]: info: crm_update_peer_proc:arsvr2.crmd is now online
Feb 10 08:58:12 arsvr1 crmd: [960]: info: crmd_client_status_callback:Not the DC
Feb 10 08:58:12 arsvr1 crmd: [960]: info: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=crmd_client_status_callback ] 
Feb 10 08:58:12 arsvr1 crmd: [960]: info: update_dc: Unset DC arsvr2 
Feb 10 08:58:12 arsvr1 attrd: [959]: info: attrd_ha_callback: flush message from arsvr2 
Feb 10 08:58:14 arsvr1 heartbeat: [898]: WARN: 1 lost packet(s) for [arsvr2] [131787:131789] 
Feb 10 08:58:14 arsvr1 heartbeat: [898]: info: No pkts missing from arsvr2!

Liang Ma
Contractuel | Consultant | SED Systems Inc. 
Ground Systems Analyst
Agence spatiale canadienne | Canadian Space Agency
6767, Route de l'Aéroport, Longueuil (St-Hubert), QC, Canada, J3Y 8Y9
Tél/Tel : (450) 926-5099 | Téléc/Fax: (450) 926-5083
Courriel/E-mail : [liang.ma at space.gc.ca]
Site web/Web site : [www.space.gc.ca ] 

-----Original Message-----
From: Andrew Beekhof [mailto:andrew at beekhof.net] 
Sent: February 10, 2011 2:39 AM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Could not connect to the CIB: Remote node did notrespond

On Wed, Feb 9, 2011 at 3:59 PM,  <Liang.Ma at asc-csa.gc.ca> wrote:
> Hi There,
>
> After a network and power shutdown, my LAMP cluster servers were totally screwed up.
>
> Now crm status gives me
>
> crm status
> ============
> Last updated: Wed Feb  9 09:44:17 2011
> Stack: Heartbeat
> Current DC: arsvr2 (bc6bf61d-6b5f-4307-85f3-bf7bb11531bb) - partition with quorum
> Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
> 2 Nodes configured, 1 expected votes
> 4 Resources configured.
> ============
>
> Online: [ arsvr1 arsvr2 ]
>
> None of the resources comes up.
>
> First I found a brain split in drbd disks. I fixed that and the drbd disks are health. I can mount them manually without problem.
>
> However if I try anything to bring up a resource or edit cib or even a query, it gives me errors as following
>
> crm resource start fs_mysql
> Call cib_replace failed (-41): Remote node did not respond <null>
>
> crm configure edit
> Could not connect to the CIB: Remote node did not respond
> ERROR: creating tmp shadow __crmshell.2540 failed
>
>
> cibadmin -Q
> Call cib_query failed (-41): Remote node did not respond <null>
>
> Any idea what I can do to bring the cluster back?

Seems like you don't have a DC.
Hard to say why without logs.

Does cibadmin -Ql work?

_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker