[Pacemaker] Nodes Switch To "pending" State

Mon Feb 11 02:48:40 EST 2013

Hello,

I am running a 3-node Pacemaker (1.1.8) + Corosync (2.1.0) cluster on Ubuntu 12.04. Two of the nodes are "real" nodes, hosting a DRBD filesystem mount and some daemons:
http://pastebin.com/n1sNMhuE
The third node cannot run resources and acts as a quorum node in standby.

Recently, the nodes will all change to the "pending" state, and may remain there for quite some time (many days) before coming back online (if ever). Using "crm node clearstate" does not help.

Tonight I stopped pacemaker and corosync on all nodes, emptied the contents of /var/lib/pacemaker/cib, /var/lib/pacemaker/pengine, and /var/lib/corosync. After doing so, I restarted corosync and pacemaker on all of the nodes, and repopulated the CIB once the nodes all joined. This worked in restoring the nodes states to "online", however after a few minutes, the nodes all went back into "pending", this time only for around 5 minutes. Here's the log from the current DC:
http://pastebin.com/xhfsb15d

There do not appear to be any faults in the corosync rings:
RING ID 0
	id	= 192.168.1.170
	status	= ring 0 active with no faults
RING ID 1
	id	= 192.168.7.170
	status	= ring 1 active with no faults

corosync.conf:
http://pastebin.com/DQUNdp9f

Some common messages I am seeing in the log:
Peer is not part of our cluster
Diff 2.106.7 -> 2.106.8 from vcs1 not applied to 2.105.12: current "epoch" is less than required (epoch, admin_epoch, and num_updates all appear in this message)
What do these messages mean? Do they indicate a problem?

Do you have any ideas on what may be causing this behavior?

Thanks,

Andrew