[ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

Bogdan Dobrelya bdobrelia at mirantis.com
Thu Feb 4 09:43:29 EST 2016


Hello.
Regarding the original issue, good news are the resource-agents
ocf-shellfuncs is no more causing fork bombs to the dummy OCF RA [0]
after the fix [1] done. The bad news are that "self-forking" monitors
issue seems remaining for the rabbitmq OCF RA [2], and I can reproduce
it for another custom agent [3], so I'd guess it may be a valid for
another ones as well.

IIUC, the issue seems related to how lrmd's forking monitor actions.
I tried to debug both pacemaker 1.1.10, 1.1.12 with gdb as the following:

# cat ./cmds
set follow-fork-mode child
set detach-on-fork off
set follow-exec-mode new
catch fork
catch vfork
cont
# gdb -x cmds /usr/lib/pacemaker/lrmd `pgrep lrmd`

I can confirm it catches forked monitors and makes nested forks as well.
But I have *many* debug symbols missing, bt is full of question marks
and, honestly, I'm not a gdb guru and do not now that to check in for
reproduced cases.

So any help with how to troubleshooting things further are very appreciated!

[0] https://github.com/bogdando/dummy-ocf-ra
[1] https://github.com/ClusterLabs/resource-agents/issues/734
[2]
https://github.com/rabbitmq/rabbitmq-server/blob/master/scripts/rabbitmq-server-ha.ocf
[3]
https://git.openstack.org/cgit/openstack/fuel-library/tree/files/fuel-ha-utils/ocf/ns_vrouter

On 04.01.2016 17:33, Bogdan Dobrelya wrote:
> On 04.01.2016 17:14, Dejan Muhamedagic wrote:
>> Hi,
>>
>> On Mon, Jan 04, 2016 at 04:52:43PM +0100, Bogdan Dobrelya wrote:
>>> On 04.01.2016 16:36, Ken Gaillot wrote:
>>>> On 01/04/2016 09:25 AM, Bogdan Dobrelya wrote:
>>>>> On 04.01.2016 15:50, Bogdan Dobrelya wrote:
>> [...]
>>>>> Also note, that lrmd spawns *many* monitors like:
>>>>> root      6495  0.0  0.0  70268  1456 ?        Ss    2015   4:56  \_
>>>>> /usr/lib/pacemaker/lrmd
>>>>> root     31815  0.0  0.0   4440   780 ?        S    15:08   0:00  |   \_
>>>>> /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>>>>> root     31908  0.0  0.0   4440   388 ?        S    15:08   0:00  |
>>>>>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>>>>> root     31910  0.0  0.0   4440   384 ?        S    15:08   0:00  |
>>>>>       \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>>>>> root     31915  0.0  0.0   4440   392 ?        S    15:08   0:00  |
>>>>>           \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>>>>> ...
>>>>
>>>> At first glance, that looks like your monitor action is calling itself
>>>> recursively, but I don't see how in your code.
>>>
>>> Yes, it should be a bug in the ocf-shellfuncs's ocf_log().
>>
>> If you're sure about that, please open an issue at
>> https://github.com/ClusterLabs/resource-agents/issues
> 
> Submitted [0]. Thank you!
> Note, that it seems the very import action causes the issue, not the
> ocf_run or ocf_log code itself.
> 
> [0] https://github.com/ClusterLabs/resource-agents/issues/734
> 
>>
>> Thanks,
>>
>> Dejan
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> 
> 


-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando




More information about the Users mailing list