[Pacemaker] issues when installing on pxe booted environment
    Andreas Kurz 
    andreas at hastexo.com
       
    Mon Mar 25 19:21:54 EDT 2013
    
    
  
On 2013-03-22 19:31, John White wrote:
> Hello Folks,
> 	We're trying to get a corosync/pacemaker instance going on a 4 node cluster that boots via pxe.  There have been a number of state/file system issues, but those appear to be *mostly* taken care of thus far.  We're running into an issue now where cib just isn't staying up with errors akin to the following (sorry for the lengthy dump, note the attrd and cib connection errors).  Any ideas would be greatly appreciated: 
> 
> Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating RNG parser context
> Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: /usr/lib64/heartbeat/attrd 
> Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster
> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up
> Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster type is: 'corosync'
> Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could not connect to the Cluster Process Group API: 2
> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed
> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection active
> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute updates
> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup
> Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: /usr/lib64/heartbeat/pengine 
> Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster
> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old instances of pengine
> Mar 22 11:25:18 n0014 pengine: [25842]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pengine
That "/var/run/crm" directory is available and owned by
hacluster.haclient ... and writable by at least the hacluster user?
Regards,
Andreas
-- 
Need help with Pacemaker?
http://www.hastexo.com/now
> Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child process attrd exited (pid=25841, rc=100)
> Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child process attrd no longer wishes to be respawned
> Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: Node n0014.lustre now has process list: 00000000000000000000000000110312 (was 00000000000000000000000000111312)
> Mar 22 11:25:18 n0014 pengine: [25842]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /var/run/crm/pengine
> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms
> Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: Adding fd=4 to mainloop
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: init_ais_connection_once: Connection to 'corosync': established
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: crm_new_peer: Creating entry for node n0014.lustre/247988234
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node n0014.lustre now has id: 247988234
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 247988234 is now known as n0014.lustre
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pcmk
> Mar 22 11:25:18 n0014 crmd: [25843]: info: Invoked: /usr/lib64/heartbeat/crmd 
> Mar 22 11:25:18 n0014 pacemakerd: [25834]: debug: pcmk_client_connect: Channel 0x995530 connected: 1 children
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: main: Starting stonith-ng mainloop
> Mar 22 11:25:18 n0014 crmd: [25843]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster
> Mar 22 11:25:18 n0014 crmd: [25843]: info: main: CRM Hg Version: a02c0f19a00c1eb2527ad38f146ebc0834814558
> Mar 22 11:25:18 n0014 crmd: [25843]: info: crmd_init: Starting crmd
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: s_crmd_fsa: Processing I_STARTUP: [ state=S_STARTING cause=C_STARTUP origin=crmd_init ]
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: #011// A_LOG   
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: #011// A_STARTUP
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Registering Signal Handlers
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Creating CIB and LRM objects
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_update_peer: Node n0014.lustre: id=247988234 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00000000000000000000000000110312 (new)
> Mar 22 11:25:18 n0014 crmd: [25843]: info: G_main_add_SignalHandler: Added signal handler for signal 17
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: #011// A_CIB_START
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/cib_rw
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /var/run/crm/cib_rw
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: Connection to command channel failed
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/cib_callback
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /var/run/crm/cib_callback
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: Connection to callback channel failed
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: Connection to CIB failed: connection failed
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signoff: Signing out of the CIB Service
> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: Element cib failed to validate content
> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: readCibXmlFile: CIB does not validate with <null>
> Mar 22 11:25:18 n0014 cib: [25839]: info: startCib: CIB Initialization completed successfully
> Mar 22 11:25:18 n0014 cib: [25839]: info: get_cluster_type: Cluster type is: 'corosync'
> Mar 22 11:25:18 n0014 cib: [25839]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: init_cpg_connection: Could not connect to the Cluster Process Group API: 2
> Mar 22 11:25:18 n0014 cib: [25839]: CRIT: cib_init: Cannot sign in to the cluster... terminating
> 
> 
> ----------------
> John White
> HPC Systems Engineer
> (510) 486-7307
> One Cyclotron Rd, MS: 50C-3209C
> Lawrence Berkeley National Lab
> Berkeley, CA 94720
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
    
    
More information about the Pacemaker
mailing list