[ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

Fri Feb 5 07:11:27 EST 2016

On 04.02.2016 15:43, Bogdan Dobrelya wrote:
> Hello.
> Regarding the original issue, good news are the resource-agents
> ocf-shellfuncs is no more causing fork bombs to the dummy OCF RA [0]
> after the fix [1] done. The bad news are that "self-forking" monitors
> issue seems remaining for the rabbitmq OCF RA [2], and I can reproduce
> it for another custom agent [3], so I'd guess it may be a valid for
> another ones as well.
> 
> IIUC, the issue seems related to how lrmd's forking monitor actions.
> I tried to debug both pacemaker 1.1.10, 1.1.12 with gdb as the following:
> 
> # cat ./cmds
> set follow-fork-mode child
> set detach-on-fork off
> set follow-exec-mode new
> catch fork
> catch vfork
> cont
> # gdb -x cmds /usr/lib/pacemaker/lrmd `pgrep lrmd`
> 
> I can confirm it catches forked monitors and makes nested forks as well.
> But I have *many* debug symbols missing, bt is full of question marks
> and, honestly, I'm not a gdb guru and do not now that to check in for
> reproduced cases.
> 
> So any help with how to troubleshooting things further are very appreciated!

I figured out this is expected behaviour. There are no fork bombs left,
but usual fork & exec syscalls each time the OCF RA is calling a shell
command or ocf_run, ocf_log functions. And those false "self-forks" are
nothing more but a transient state between the fork and exec calls, when
the caption of the child process has yet to be updated... So I believe
the problem was solved by the aforementioned patch completely.

> 
> [0] https://github.com/bogdando/dummy-ocf-ra
> [1] https://github.com/ClusterLabs/resource-agents/issues/734
> [2]
> https://github.com/rabbitmq/rabbitmq-server/blob/master/scripts/rabbitmq-server-ha.ocf
> [3]
> https://git.openstack.org/cgit/openstack/fuel-library/tree/files/fuel-ha-utils/ocf/ns_vrouter
> 
> On 04.01.2016 17:33, Bogdan Dobrelya wrote:
>> On 04.01.2016 17:14, Dejan Muhamedagic wrote:
>>> Hi,
>>>
>>> On Mon, Jan 04, 2016 at 04:52:43PM +0100, Bogdan Dobrelya wrote:
>>>> On 04.01.2016 16:36, Ken Gaillot wrote:
>>>>> On 01/04/2016 09:25 AM, Bogdan Dobrelya wrote:
>>>>>> On 04.01.2016 15:50, Bogdan Dobrelya wrote:
>>> [...]
>>>>>> Also note, that lrmd spawns *many* monitors like:
>>>>>> root      6495  0.0  0.0  70268  1456 ?        Ss    2015   4:56  \_
>>>>>> /usr/lib/pacemaker/lrmd
>>>>>> root     31815  0.0  0.0   4440   780 ?        S    15:08   0:00  |   \_
>>>>>> /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>>>>>> root     31908  0.0  0.0   4440   388 ?        S    15:08   0:00  |
>>>>>>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>>>>>> root     31910  0.0  0.0   4440   384 ?        S    15:08   0:00  |
>>>>>>       \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>>>>>> root     31915  0.0  0.0   4440   392 ?        S    15:08   0:00  |
>>>>>>           \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>>>>>> ...
>>>>>
>>>>> At first glance, that looks like your monitor action is calling itself
>>>>> recursively, but I don't see how in your code.
>>>>
>>>> Yes, it should be a bug in the ocf-shellfuncs's ocf_log().
>>>
>>> If you're sure about that, please open an issue at
>>> https://github.com/ClusterLabs/resource-agents/issues
>>
>> Submitted [0]. Thank you!
>> Note, that it seems the very import action causes the issue, not the
>> ocf_run or ocf_log code itself.
>>
>> [0] https://github.com/ClusterLabs/resource-agents/issues/734
>>
>>>
>>> Thanks,
>>>
>>> Dejan
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
> 
> 

-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando