[Pacemaker] Nodes Switch To "pending" State

Tue Feb 19 00:58:16 EST 2013

This looks like the underlying problem:

Feb 10 23:58:07 [1199] vcsquorum        cib:   notice: cib:diff: 	--
    <node uname="vcsquorum.example.com" id="755053578" />
Feb 10 23:58:07 [1199] vcsquorum        cib:   notice: cib:diff: 	++
    <node id="755053578" uname="vcsquorum" />

Something is confused about what the node(s) should be called.

On Mon, Feb 11, 2013 at 6:48 PM, Andrew Martin <amartin at xes-inc.com> wrote:
> Hello,
>
> I am running a 3-node Pacemaker (1.1.8) + Corosync (2.1.0) cluster on Ubuntu 12.04. Two of the nodes are "real" nodes, hosting a DRBD filesystem mount and some daemons:
> http://pastebin.com/n1sNMhuE
> The third node cannot run resources and acts as a quorum node in standby.
>
> Recently, the nodes will all change to the "pending" state, and may remain there for quite some time (many days) before coming back online (if ever). Using "crm node clearstate" does not help.
>
> Tonight I stopped pacemaker and corosync on all nodes, emptied the contents of /var/lib/pacemaker/cib, /var/lib/pacemaker/pengine, and /var/lib/corosync. After doing so, I restarted corosync and pacemaker on all of the nodes, and repopulated the CIB once the nodes all joined. This worked in restoring the nodes states to "online", however after a few minutes, the nodes all went back into "pending", this time only for around 5 minutes. Here's the log from the current DC:
> http://pastebin.com/xhfsb15d
>
> There do not appear to be any faults in the corosync rings:
> RING ID 0
>         id      = 192.168.1.170
>         status  = ring 0 active with no faults
> RING ID 1
>         id      = 192.168.7.170
>         status  = ring 1 active with no faults
>
> corosync.conf:
> http://pastebin.com/DQUNdp9f
>
> Some common messages I am seeing in the log:
> Peer is not part of our cluster
> Diff 2.106.7 -> 2.106.8 from vcs1 not applied to 2.105.12: current "epoch" is less than required (epoch, admin_epoch, and num_updates all appear in this message)
> What do these messages mean? Do they indicate a problem?
>
> Do you have any ideas on what may be causing this behavior?
>
> Thanks,
>
> Andrew
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org