[ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

Thu Dec 31 06:57:45 EST 2015

Hello.
I've been hopelessly fighting a bug [0] in the custom OCF agent of Fuel
for OpenStack project. It is related to the destructive test case when
one node of 3 or 5 total goes down and then back. The bug itself is
tricky (is rarely reproduced), tl;dr, and has many duplicates. So I only
put here the latest comment.

As it says,
at some point, after the rabbit OCF monitor reported an error followed
by several "not running" reports (see crmd log snippet [1]), pacemaker
starts "thinking" everything is fine with the resource and shows it as
"running". While in fact it is completely dead and manually triggered
OCF action monitor may confirm that (not running). But *why* pacemaker
shows the resource is running and never calls monitor actions again?
I have no idea how to proceed with the root cause of such pacemaker
behaviour.

So, I'm asking for guidance on the any recommendations on how-to debug
and troubleshoot this strange situation and for which useful log
patterns to seek (and where).
Thank you in advance!

PS. this is Pacemaker 1.1.12, Corosync 2.3.4,  libqb0 0.17.0 from Ubuntu
vivid. But the Corosync & Pacemaker cluster looks healthy and I can find
no log records saying otherwise.

[0] https://bugs.launchpad.net/fuel/+bug/1472230/comments/32
[1] http://pastebin.com/0UuBvzzz

-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando