[ClusterLabs] [Pacemaker] Pacemaker won't start after node was fenced

Jake Smith jsmith at argotec.com
Tue Mar 3 17:35:37 EST 2015


That will be tough but I'll see if I can give it a try sometime soon.

Have had no luck tracking down that error so running out of other options :/

Jake

-----Original Message-----
From: Andrew Beekhof [mailto:andrew at beekhof.net]
Sent: Monday, February 23, 2015 7:43 PM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Pacemaker won't start after node was fenced


> On 27 Jan 2015, at 5:23 pm, Jake Smith <jsmith at argotec.com> wrote:
>
> Had a failover of my active/passive cluster and now the passive node will 
> not rejoin the cluster.
>
> 2 nodes running Ubuntu 12.04
> coro 1.4.2-2, openais 1.1.4-4, pcmk 1.1.6-2ubuntu3
>
> Corosync ring membership is fine on both rings.
>
> Tried stopping coro/pace and clearing /var/lib/heartbeat/crm/ and then 
> restarting on passive node without success.
> Tried rebooting passive node (again – it was successfully fenced)
> Tried updating pacemaker to latest in distro (1.1.6-2ubuntu3.3) then
> went back on passive node Tried putting active node in maintenance mode 
> and stopping pacemaker and corosync on both nodes.  Then restarting on 
> both nodes.  Corosync came back fine as before but now I have the same 
> problem on both nodes with pacemaker not starting successfully.  Both show 
> exactly same now - attrd: [24883]: ERROR: main: HA Signon failed.
>
> Log:
> Jan 27 01:09:59 Condor crmd: [24885]: info: crmd_init: Starting crmd
> Jan 27 01:09:59 Condor cib: [24881]: info: validate_with_relaxng:
> Creating RNG parser context Jan 27 01:09:59 Condor lrmd: [24882]:
> info: enabling coredumps Jan 27 01:09:59 Condor lrmd: [24882]: info: 
> Started.
> Jan 27 01:09:59 Condor corosync[24778]:   [IPC   ] Invalid IPC 
> credentials.

This seems to be the root of the errors.
Pacemaker looks a little old, could you consider updating?

> Jan 27 01:09:59 Condor attrd: [24883]: ERROR: main: HA Signon failed
> Jan 27 01:09:59 Condor attrd: [24883]: ERROR: main: Aborting startup
> Jan 27 01:09:59 Condor pacemakerd: [24877]: ERROR: pcmk_child_exit:
> Child process attrd exited (pid=24883, rc=100) Jan 27 01:09:59 Condor
> pacemakerd: [24877]: notice: pcmk_child_exit: Child process attrd no
> longer wishes to be respawned Jan 27 01:09:59 Condor pacemakerd:
> [24877]: info: update_node_processes: Node Condor now has process
> list: 00000000000000000000000000110312 (was
> 00000000000000000000000000111312) Jan 27 01:09:59 Condor stonith-ng:
> [24880]: info: init_ais_connection_classic: AIS connection established
> Jan 27 01:09:59 Condor stonith-ng: [24880]: info: get_ais_nodeid:
> Server details: id=167837962 uname=Condor cname=pcmk Jan 27 01:09:59
> Condor stonith-ng: [24880]: info: init_ais_connection_once: Connection
> to 'classic openais (with plugin)': established Jan 27 01:09:59 Condor 
> stonith-ng: [24880]: info: crm_new_peer: Node Condor now has id: 167837962 
> Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_new_peer: Node 
> 167837962 is now known as Condor Jan 27 01:09:59 Condor stonith-ng: 
> [24880]: info: main: Starting stonith-ng mainloop Jan 27 01:09:59 Condor 
> stonith-ng: [24880]: info: crm_update_peer: Node Condor: id=167837962 
> state=unknown addr=(null) votes=0 born=0 seen=0 
> proc=00000000000000000000000000110312 (new) Jan 27 01:09:59 Condor cib: 
> [24881]: info: startCib: CIB Initialization completed successfully Jan 27 
> 01:09:59 Condor cib: [24881]: info: get_cluster_type: Cluster type is: 
> 'openais'
> Jan 27 01:09:59 Condor cib: [24881]: notice: crm_cluster_connect:
> Connecting to cluster infrastructure: classic openais (with plugin) Jan 27 
> 01:09:59 Condor cib: [24881]: info: init_ais_connection_classic: Creating 
> connection to our Corosync plugin
> Jan 27 01:09:59 Condor corosync[24778]:   [IPC   ] Invalid IPC 
> credentials.
> Jan 27 01:09:59 Condor cib: [24881]: info:
> init_ais_connection_classic: Connection to our AIS plugin (9) failed:
> unknown (100) Jan 27 01:09:59 Condor cib: [24881]: CRIT: cib_init:
> Cannot sign in to the cluster... terminating Jan 27 01:09:59 Condor
> pacemakerd: [24877]: ERROR: pcmk_child_exit: Child process cib exited
> (pid=24881, rc=100) Jan 27 01:09:59 Condor pacemakerd: [24877]:
> notice: pcmk_child_exit: Child process cib no longer wishes to be
> respawned Jan 27 01:09:59 Condor pacemakerd: [24877]: info:
> update_node_processes: Node Condor now has process list:
> 00000000000000000000000000110212 (was
> 00000000000000000000000000110312) Jan 27 01:09:59 Condor stonith-ng:
> [24880]: info: crm_update_peer: Node Condor: id=167837962
> state=unknown addr=(null) votes=0 born=0 seen=0
> proc=00000000000000000000000000110212 (new) Jan 27 01:10:00 Condor
> crmd: [24885]: info: do_cib_control: Could not connect to the CIB
> service: connection failed Jan 27 01:10:00 Condor crmd: [24885]: WARN:
> do_cib_control: Couldn't complete CIB registration 1 times... pause
> and retry Jan 27 01:10:00 Condor crmd: [24885]: info: crmd_init:
> Starting crmd's mainloop Jan 27 01:10:01 Condor CRON[24888]: (root)
> CMD (/etc/init.d/watchdog -e >/dev/null 2>&1) Jan 27 01:10:02 Condor
> crmd: [24885]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
> (2000ms) Jan 27 01:10:03 Condor crmd: [24885]: info: do_cib_control:
> Could not connect to the CIB service: connection failed Jan 27
> 01:10:03 Condor crmd: [24885]: WARN: do_cib_control: Couldn't complete
> CIB registration 2 times... pause and retry Jan 27 01:10:05 Condor
> crmd: [24885]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
> (2000ms) Jan 27 01:10:06 Condor crmd: [24885]: info: do_cib_control:
> Could not connect to the CIB service: connection failed Jan 27
> 01:10:06 Condor crmd: [24885]: WARN: do_cib_control: Couldn't complete
> CIB registration 3 times... pause and retry Jan 27 01:10:08 Condor
> crmd: [24885]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
> (2000ms) Jan 27 01:10:09 Condor crmd: [24885]: info: do_cib_control:
> Could not connect to the CIB service: connection failed Jan 27
> 01:10:09 Condor crmd: [24885]: WARN: do_cib_control: Couldn't complete
> CIB registration 4 times... pause and retry Jan 27 01:10:11 Condor
> crmd: [24885]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
> (2000ms) Jan 27 01:10:12 Condor crmd: [24885]: info: do_cib_control:
> Could not connect to the CIB service: connection failed Jan 27
> 01:10:12 Condor crmd: [24885]: WARN: do_cib_control: Couldn't complete
> CIB registration 5 times... pause and retry
>
> Jacob A. Smith
> IT Manager
> Argotec, LLC
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




More information about the Users mailing list