[Pacemaker] Nodes Switch To "pending" State
Andrew Beekhof
andrew at beekhof.net
Tue Feb 19 05:58:16 UTC 2013
This looks like the underlying problem:
Feb 10 23:58:07 [1199] vcsquorum cib: notice: cib:diff: --
<node uname="vcsquorum.example.com" id="755053578" />
Feb 10 23:58:07 [1199] vcsquorum cib: notice: cib:diff: ++
<node id="755053578" uname="vcsquorum" />
Something is confused about what the node(s) should be called.
On Mon, Feb 11, 2013 at 6:48 PM, Andrew Martin <amartin at xes-inc.com> wrote:
> Hello,
>
> I am running a 3-node Pacemaker (1.1.8) + Corosync (2.1.0) cluster on Ubuntu 12.04. Two of the nodes are "real" nodes, hosting a DRBD filesystem mount and some daemons:
> http://pastebin.com/n1sNMhuE
> The third node cannot run resources and acts as a quorum node in standby.
>
> Recently, the nodes will all change to the "pending" state, and may remain there for quite some time (many days) before coming back online (if ever). Using "crm node clearstate" does not help.
>
> Tonight I stopped pacemaker and corosync on all nodes, emptied the contents of /var/lib/pacemaker/cib, /var/lib/pacemaker/pengine, and /var/lib/corosync. After doing so, I restarted corosync and pacemaker on all of the nodes, and repopulated the CIB once the nodes all joined. This worked in restoring the nodes states to "online", however after a few minutes, the nodes all went back into "pending", this time only for around 5 minutes. Here's the log from the current DC:
> http://pastebin.com/xhfsb15d
>
> There do not appear to be any faults in the corosync rings:
> RING ID 0
> id = 192.168.1.170
> status = ring 0 active with no faults
> RING ID 1
> id = 192.168.7.170
> status = ring 1 active with no faults
>
> corosync.conf:
> http://pastebin.com/DQUNdp9f
>
> Some common messages I am seeing in the log:
> Peer is not part of our cluster
> Diff 2.106.7 -> 2.106.8 from vcs1 not applied to 2.105.12: current "epoch" is less than required (epoch, admin_epoch, and num_updates all appear in this message)
> What do these messages mean? Do they indicate a problem?
>
> Do you have any ideas on what may be causing this behavior?
>
> Thanks,
>
> Andrew
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Pacemaker
mailing list