[ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

Thu Dec 31 09:33:45 EST 2015

On 31.12.2015 14:48, Vladislav Bogdanov wrote:
> blackbox tracing inside pacemaker, USR1, USR2 and TRAP signals iirc, quick google search should point you to Andrew's blog with all information about that feature.
> Next, if you use ocf-shellfuncs in your RA, you could enable tracing for resource itself, just add 'trace_ra=1' to every operation config (start and monitor).

Thank you, I will try to play with these things once I have the issue
reproduced again. Cannot provide CIB as I don't have the env now.

But still let me ask again, do anyone know or heard of anything like
known/fixed bugs about corosync with pacemaker stop running monitor
actions for a resource at some point, while notifications are still logged?

Here is example:
node-16 crmd:
2015-12-29T13:16:49.113679+00:00 notice:    notice: process_lrm_event:
Operation p_rabbitmq-server_monitor_27000: unknown error
(node=node-16.test.domain.local, call=254, rc=1, cib-updat
e=1454, confirmed=false)
node-17:
2015-12-29T13:16:57.603834+00:00 notice:    notice: process_lrm_event:
Operation p_rabbitmq-server_monitor_103000: unknown error
(node=node-17.test.domain.local, call=181, rc=1, cib-upda
te=297, confirmed=false)
node-18:
2015-12-29T13:20:16.870619+00:00 notice:    notice: process_lrm_event:
Operation p_rabbitmq-server_monitor_103000: not running
(node=node-18.test.domain.local, call=187, rc=7, cib-update
=306, confirmed=false)
node-20:
2015-12-29T13:20:51.486219+00:00 notice:    notice: process_lrm_event:
Operation p_rabbitmq-server_monitor_30000: not running
(node=node-20.test.domain.local, call=180, rc=7, cib-update=
308, confirmed=false)

after that point only notifications got logged for affected nodes, like
Operation p_rabbitmq-server_notify_0: ok
(node=node-20.test.domain.local, call=287, rc=0, cib-update=0, confirmed=t
rue)

While the node-19 was not affected, and actions
monitor/stop/start/notify logged OK all the time, like:
2015-12-29T14:30:00.973561+00:00 notice:    notice: process_lrm_event:
Operation p_rabbitmq-server_monitor_30000: not running
(node=node-19.test.domain.local, call=423, rc=7, cib-update=438,
confirmed=false)
2015-12-29T14:30:01.631609+00:00 notice:    notice: process_lrm_event:
Operation p_rabbitmq-server_notify_0: ok
(node=node-19.test.domain.local, call=424, rc=0, cib-update=0,
confirmed=true)
2015-12-29T14:31:19.084165+00:00 notice:    notice: process_lrm_event:
Operation p_rabbitmq-server_stop_0: ok (node=node-19.test.domain.local,
call=427, rc=0, cib-update=439, confirmed=true)
2015-12-29T14:32:53.120157+00:00 notice:    notice: process_lrm_event:
Operation p_rabbitmq-server_start_0: unknown error
(node=node-19.test.domain.local, call=428, rc=1, cib-update=441,
confirmed=true)

-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando