[Pacemaker] On recovery of failed node, pengine fails to correctly monitor 'dirty' resources

Mon Aug 11 14:03:48 EDT 2014

Greetings, 

We are using pacemaker and cman in a two-node cluster with no-quorum-policy: ignore and stonith-enabled: false on a Centos 6 system (pacemaker related RPM versions are listed below).  We are seeing some bizarre (to us) behavior when a node is fully lost (e.g. reboot -nf ).  Here's the scenario we have:

1) Fail a resource named "some-resource" started with the ocf:heartbeat:anything script (or others) on node01 (in our case, it's a master/slave resource we're pulling observations from, but it can happen on normal ones).
2) Wait for Resource to recover.
3) Fail node02 (reboot -nf, or power loss)
4) When node02 recovers, we see in /var/log/messages:
  - Quorum is recovered
  - Sending flush op to all hosts for master-some-resource, last-failure-some-resource, probe_complete(true), fail-count-some-resource(1) 
  - pengine Processing failed op monitor for some-resource on node01: unknown error (1)
    * After adding a simple "`date` called with $@ >> /tmp/log.rsc", we do not see the resource agent being called at this time, on either node.
    * Sometimes, we see other operations happen that are also not being sent to the RA, including stop/start
    * The resource is actually happilly running on node01 throughtout this whole process, so there's no reason we should be seeing this failure here. 
    * This issue is only seen on resources that had not yet been cleaned up.  Resources that were 'clean' when both nodes were last online do not have this issue. 

We noticed this originally because we are using the ClusterMon RA to report on different types of errors, and this is giving us false positives. Any thoughts on configuration issues we could be having, or if this sounds like a bug in pacemaker somewhere? 

Thanks!

----
Versions:
ccs-0.16.2-69.el6_5.1.x86_64
clusterlib-3.0.12.1-59.el6_5.2.x86_64
cman-3.0.12.1-59.el6_5.2.x86_64
corosync-1.4.1-17.el6_5.1.x86_64
corosynclib-1.4.1-17.el6_5.1.x86_64
fence-virt-0.2.3-15.el6.x86_64
libqb-0.16.0-2.el6.x86_64
modcluster-0.16.2-28.el6.x86_64
openais-1.1.1-7.el6.x86_64
openaislib-1.1.1-7.el6.x86_64
pacemaker-1.1.10-14.el6_5.3.x86_64
pacemaker-cli-1.1.10-14.el6_5.3.x86_64
pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64
pacemaker-libs-1.1.10-14.el6_5.3.x86_64
pcs-0.9.90-2.el6.centos.3.noarch
resource-agents-3.9.2-40.el6_5.7.x86_64
ricci-0.16.2-69.el6_5.1.x86_64
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140811/2297289e/attachment-0002.html>