[ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

Mon Jan 4 05:34:08 EST 2016

On 01.01.2016 11:34, Vladislav Bogdanov wrote:
> 31.12.2015 15:33:45 CET, Bogdan Dobrelya <bdobrelia at mirantis.com> wrote:
>> On 31.12.2015 14:48, Vladislav Bogdanov wrote:
>>> blackbox tracing inside pacemaker, USR1, USR2 and TRAP signals iirc,
>> quick google search should point you to Andrew's blog with all
>> information about that feature.
>>> Next, if you use ocf-shellfuncs in your RA, you could enable tracing
>> for resource itself, just add 'trace_ra=1' to every operation config
>> (start and monitor).
>>
>> Thank you, I will try to play with these things once I have the issue
>> reproduced again. Cannot provide CIB as I don't have the env now.
>>
>> But still let me ask again, do anyone know or heard of anything like
>> known/fixed bugs about corosync with pacemaker stop running monitor
>> actions for a resource at some point, while notifications are still
>> logged?
>>
>> Here is example:
>> node-16 crmd:
>> 2015-12-29T13:16:49.113679+00:00 notice:    notice: process_lrm_event:
>> Operation p_rabbitmq-server_monitor_27000: unknown error
>> (node=node-16.test.domain.local, call=254, rc=1, cib-updat
>> e=1454, confirmed=false)
>> node-17:
>> 2015-12-29T13:16:57.603834+00:00 notice:    notice: process_lrm_event:
>> Operation p_rabbitmq-server_monitor_103000: unknown error
>> (node=node-17.test.domain.local, call=181, rc=1, cib-upda
>> te=297, confirmed=false)
>> node-18:
>> 2015-12-29T13:20:16.870619+00:00 notice:    notice: process_lrm_event:
>> Operation p_rabbitmq-server_monitor_103000: not running
>> (node=node-18.test.domain.local, call=187, rc=7, cib-update
>> =306, confirmed=false)
>> node-20:
>> 2015-12-29T13:20:51.486219+00:00 notice:    notice: process_lrm_event:
>> Operation p_rabbitmq-server_monitor_30000: not running
>> (node=node-20.test.domain.local, call=180, rc=7, cib-update=
>> 308, confirmed=false)
>>
>> after that point only notifications got logged for affected nodes, like
>> Operation p_rabbitmq-server_notify_0: ok
>> (node=node-20.test.domain.local, call=287, rc=0, cib-update=0,
>> confirmed=t
>> rue)
>>
>> While the node-19 was not affected, and actions
>> monitor/stop/start/notify logged OK all the time, like:
>> 2015-12-29T14:30:00.973561+00:00 notice:    notice: process_lrm_event:
>> Operation p_rabbitmq-server_monitor_30000: not running
>> (node=node-19.test.domain.local, call=423, rc=7, cib-update=438,
>> confirmed=false)
>> 2015-12-29T14:30:01.631609+00:00 notice:    notice: process_lrm_event:
>> Operation p_rabbitmq-server_notify_0: ok
>> (node=node-19.test.domain.local, call=424, rc=0, cib-update=0,
>> confirmed=true)
>> 2015-12-29T14:31:19.084165+00:00 notice:    notice: process_lrm_event:
>> Operation p_rabbitmq-server_stop_0: ok (node=node-19.test.domain.local,
>> call=427, rc=0, cib-update=439, confirmed=true)
>> 2015-12-29T14:32:53.120157+00:00 notice:    notice: process_lrm_event:
>> Operation p_rabbitmq-server_start_0: unknown error
>> (node=node-19.test.domain.local, call=428, rc=1, cib-update=441,
>> confirmed=true)
> 
> Well, not running and not logged is not the same thing. I do not have access to code right now, but I'm pretty sure that successful recurring monitors are not logged after the first run. trace_ra for monitor op should prove that. If not, then it should be a bug. I recall something was fixed in that area recently.
> 

Is it http://bugs.clusterlabs.org/show_bug.cgi?id=5072 /
http://bugs.clusterlabs.org/show_bug.cgi?id=5063 ? I found nothing more
recent in the pacemaker commits and issues. While not *exactly* my case
though, several promote and demote actions still had took a place due
the test.

Btw, as I understood from the bug 5072/5063 comments, it remains unfixed
for some reported cases, am I right?

> Best,
> Vladislav
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando