[Pacemaker] node can't join cluster after reboot

Vladimir Elisseev vovan at vovan.nl
Tue Oct 30 05:35:59 UTC 2012


Thanks for trying to help! Currently I can't provide crm_report from the
failed node, as I've decided to restore the complete node from backup.
The versions I use are corosync-1.3.0 and pacemaker-1.0.10. Actually the
problem occurred after updating quiet a few system packages, but all the
cluster related software was untouched. I've found exactly the same
issue described in the mailing list earlier:
http://www.gossamer-threads.com/lists/linuxha/pacemaker/77881?do=post_view_threaded#77881
At least symptoms are exactly the same as well as pasted log files. I've
tried enable debug logging as well and saw that crm tries to connect to
cib sockets (/var/run/crm_*) too early (IMO) and fails because cib
wasn't started yet. 
I'm planning to repeat update of these system again, but I'll do this
more carefully in order to understand which particular package leads to
this behavior. BTW, how can I create crm_report? I can't find this
binary anywhere on the system. Let me know what kind of input you'll
need if I'll be able to reproduce this problem.

Regards,
Vlad.


On Tue, 2012-10-30 at 16:00 +1100, Andrew Beekhof wrote:
> On Sun, Oct 28, 2012 at 9:05 PM, Vladimir Elisseev <vovan at vovan.nl> wrote:
> > Hello,
> >
> > I'm having problem that after reboot one cluster node can't join cluster
> > anymore. Form the log file I can't understand what actually is going on.
> > I only can see, that cib and crm both are respawned frequently. I'd
> > appreciate any help. Below is relevant part of the log file:
> 
> I appreciate that you're trying to keep it brief, but problems often
> originate much earlier than people suspect.
> Can you instead attach a crm_report tarball, that will have everything
> (from both nodes) that we need to be able to help.
> 
> What version is this btw?
> 
> >
> > Oct 28 10:52:22 srv2 cib: [10646]: info: cib_server_process_diff: Requesting re-sync from peer
> > Oct 28 10:52:22 srv2 cib: [10646]: WARN: cib_diff_notify: Local-only Change (client:crmd, call: 4770): -1.-1.-1 (Application of an update diff failed, requesting a full refresh)
> > Oct 28 10:52:22 srv2 cib: [10653]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.qJTUAV (digest: /var/lib/heartbeat/crm/cib.XwOKXQ)
> > Oct 28 10:52:22 srv2 cib: [10646]: WARN: cib_server_process_diff: Not applying diff 0.1298.5 -> 0.1299.1 (sync in progress)
> > Oct 28 10:52:22 srv2 cib: [10646]: info: cib_replace_notify: Local-only Replace: -1.-1.-1 from srv1
> > Oct 28 10:52:22 corosync [pcmk]:  ] info: pcmk_ipc_exit: Client cib (conn=0x1837340, async-conn=0x1837340) left
> > Oct 28 10:52:22 corosync [pcmk]:  ] ERROR: pcmk_wait_dispatch: Child process cib terminated with signal 6 (pid=10646, core=true)
> > Oct 28 10:52:22 corosync [pcmk]:  ] notice: pcmk_wait_dispatch: Respawning failed child process: cib
> > Oct 28 10:52:22 corosync [pcmk]:  ] info: spawn_child: Forked child 10656 for process cib
> > Oct 28 10:52:22 srv2 cib: [10656]: info: Invoked: /usr/lib64/heartbeat/cib
> >
> >
> > Regards,
> > Vlad.
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org






More information about the Pacemaker mailing list