[ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

Mon Jan 4 09:50:30 EST 2016

So far so bad.
I made a dummy OCF script [0] to simulate an example
promote/demote/notify failure mode for a multistate clone resource which
is very similar to the one I reported originally. And the test to
reproduce my case with the dummy is:
- install dummy resource ocf ra and create the dummy resource as README
[0] says
- just watch the a) OCF logs from the dummy and b) outputs for the
reoccurring commands:

# while true; do date; ls /var/lib/heartbeat/trace_ra/dummy/ | tail -1;
sleep 20; done&
# crm_resource --resource p_dummy --list-operations

At some point I noticed:
- there are no more "OK" messages logged from the monitor actions,
although according to the trace_ra dumps' timestamps, all monitors are
still being invoked!

- at some point I noticed very strange results reported by the:
# crm_resource --resource p_dummy --list-operations
p_dummy (ocf::dummy:dummy):     FAILED : p_dummy_monitor_103000
(node=node-1.test.domain.local, call=579, rc=1, last-rc-change=Mon Jan
4 14:33:07 2016, exec=62107ms): Timed Out
  or
p_dummy (ocf::dummy:dummy):     Started : p_dummy_monitor_103000
(node=node-3.test.domain.local, call=-1, rc=1, last-rc-change=Mon Jan  4
14:43:58 2016, exec=0ms): Timed Out

- according to the trace_ra dumps reoccurring monitors are being invoked
by the intervals *much longer* than configured. For example, a 7 minutes
of "monitoring silence":
Mon Jan  4 14:47:46 UTC 2016
p_dummy.monitor.2016-01-04.14:40:52
Mon Jan  4 14:48:06 UTC 2016
p_dummy.monitor.2016-01-04.14:47:58

Given that said, it is very likely there is some bug exist for
monitoring multi-state clones in pacemaker!

[0] https://github.com/bogdando/dummy-ocf-ra

-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando