[ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

Thu Dec 31 08:48:42 EST 2015

31.12.2015 12:57:45 CET, Bogdan Dobrelya <bdobrelia at mirantis.com> wrote:
>Hello.
>I've been hopelessly fighting a bug [0] in the custom OCF agent of Fuel
>for OpenStack project. It is related to the destructive test case when
>one node of 3 or 5 total goes down and then back. The bug itself is
>tricky (is rarely reproduced), tl;dr, and has many duplicates. So I
>only
>put here the latest comment.
>
>As it says,
>at some point, after the rabbit OCF monitor reported an error followed
>by several "not running" reports (see crmd log snippet [1]), pacemaker
>starts "thinking" everything is fine with the resource and shows it as
>"running". While in fact it is completely dead and manually triggered
>OCF action monitor may confirm that (not running). But *why* pacemaker
>shows the resource is running and never calls monitor actions again?
>I have no idea how to proceed with the root cause of such pacemaker
>behaviour.
>
>So, I'm asking for guidance on the any recommendations on how-to debug
>and troubleshoot this strange situation and for which useful log
>patterns to seek (and where).
>Thank you in advance!
>
>PS. this is Pacemaker 1.1.12, Corosync 2.3.4,  libqb0 0.17.0 from
>Ubuntu
>vivid. But the Corosync & Pacemaker cluster looks healthy and I can
>find
>no log records saying otherwise.
>
>[0] https://bugs.launchpad.net/fuel/+bug/1472230/comments/32
>[1] http://pastebin.com/0UuBvzzz

Hi.
First, could you paste your CIB, preferably not in xml, but in crmsh format? Just to check that everything is fine with resource and fencing configuration.
Then, you may enable blackbox tracing inside pacemaker, USR1, USR2 and TRAP signals iirc, quick google search should point you to Andrew's blog with all information about that feature.
Next, if you use ocf-shellfuncs in your RA, you could enable tracing for resource itself, just add 'trace_ra=1' to every operation config (start and monitor).

All that may give you some additional hints on what's going on.

Also, you may think about upgrading pacemaker to 1.1.14-rcX, together with libqb to 0.17.2 (and rebuild corosync against that libqb).

Best,
Vladislav