[ClusterLabs] RA hangs when called by crm_resource (resending text format)
Madison Kelly
mkelly at alteeve.com
Wed Jan 11 15:48:43 EST 2023
On 2023-01-11 01:59, Reid Wahl wrote:
> On Tue, Jan 10, 2023 at 10:14 PM Vladislav Bogdanov
> <bubble at hoster-ok.com> wrote:
>>
>> I suspect that valudate action is run as a non-root user.
>
> As far as I know, both the direct command and crm_resource **should**
> be running the agent as the same user, as long as Madison is running
> both commands as the same user.
>
> For what it's worth, I copied your test script to my machine (Fedora
> 36 using the current upstream main of Pacemaker) and it worked fine
> both directly and via crm_resource. At the moment I'm not able to dig
> very deeply, but I do wonder if it's either a bug that's since been
> fixed, or perhaps an environment issue.
>
> To try to rule out the former, do you have a test environment where
> you can try to reproduce it on the latest Pacemaker from upstream?
I built the pacemaker source RPM from Fedora 37, then realized I'm
already running 2.1.5 on CS8, so I'm already on the latest release.
Looking at git, 2.1.5 is the latest tagged release... Are you running
newer than that?
>> Madison Kelly <mkelly at alteeve.com> 11 января 2023 г. 07:06:55 написал:
>>
>>> On 2023-01-11 00:21, Madison Kelly wrote:
>>>>
>>>> On 2023-01-11 00:14, Madison Kelly wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> Edit: Last message was in HTML format, sorry about that.
>>>>>
>>>>> I've got a hell of a weird problem, and I am absolutely stumped on
>>>>> what's going on.
>>>>>
>>>>> The short of it is; if my RA is called from the command line, it's
>>>>> fine. If a resource exists, monitor, enable, disable, all that stuff
>>>>> works just fine. If I try to create a resource, it hangs on the
>>>>> validate stage. Specifically, it hangs when 'pcs' calls:
>>>>>
>>>>> crm_resource --validate --output-as xml --class ocf --agent server
>>>>> --provider alteeve --option name=<resource_name>
>>>>>
>>>>> Specifically, it hangs when it tries to make a shell call (to
>>>>> virsh, specifically, but that doesn't matter). So to debug, I started
>>>>> stripping down my RA simpler and simpler until I was left with the
>>>>> very most basic of programs;
>>>>>
>>>>> https://pastebin.com/VtSpkwMr
>>>>>
>>>>> That is literally the simplest program I could write that made the
>>>>> shell call. The 'open()' call is where it hangs.
>>>>>
>>>>> When I call directly;
>>>>>
>>>>> time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server
>>>>> srv04-test; echo rc:$?
>>>>>
>>>>> ====
>>>>> real 0m0.061s
>>>>> user 0m0.037s
>>>>> sys 0m0.014s
>>>>> rc:0
>>>>> ====
>>>>>
>>>>> It's just fine. I can see in the log the output from the 'virsh' call
>>>>> as well. However, when I call from crm_resource;
>>>>>
>>>>> time crm_resource --validate --output-as xml --class ocf --agent
>>>>> server --provider alteeve --option name=srv04-test; echo rc:$?
>>>>>
>>>>> ====
>>>>> <pacemaker-result api-version="2.25" request="crm_resource --validate
>>>>> --output-as xml --class ocf --agent server --provider alteeve --option
>>>>> name=srv04-test">
>>>>> <resource-agent-action action="validate" class="ocf" type="server"
>>>>> provider="alteeve">
>>>>> <overrides/>
>>>>> <agent-status code="1" message="error" execution_code="2"
>>>>> execution_message="Timed Out" reason="Resource agent did not exit
>>>>> within specified timeout"/>
>>>>> </resource-agent-action>
>>>>> <status code="1" message="Error occurred">
>>>>> <errors>
>>>>> <error>crm_resource: Error performing operation: Error
>>>>> occurred</error>
>>>>> </errors>
>>>>> </status>
>>>>> </pacemaker-result>
>>>>>
>>>>> real 0m20.521s
>>>>> user 0m0.022s
>>>>> sys 0m0.010s
>>>>> rc:1
>>>>> ====
>>>>>
>>>>> In the log file, I see (from line 20 of the super-simple-test-script):
>>>>>
>>>>> ====
>>>>> Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1;
>>>>> /usr/bin/echo return_code:0 |]
>>>>> ====
>>>>>
>>>>> Then nothing else.
>>>>>
>>>>> The strace output is: https://pastebin.com/raw/UCEUdBeP
>>>>>
>>>>> Environment;
>>>>>
>>>>> * selinux is permissive
>>>>> * Pacemaker 2.1.5-4.el8
>>>>> * pcs 0.10.15
>>>>> * 4.18.0-408.el8.x86_64
>>>>> * CentOS Stream release 8
>>>>>
>>>>> Any help is appreciated, I am stumped. :/
>>>>
>>>>
>>>> After sending this, I tried having my "RA" call 'hostname', and that
>>>> worked fine. I switched back to 'virsh list --all', and that hangs. So
>>>> it seems to somehow be related to call 'virsh' specifically.
>>>>
>>>
>>> OK, so more info... Knowing now that it's a problem with the virsh call
>>> specifically (but only when validating, existing VMs monitor, enable,
>>> disable fine, all which repeatedly call virsh), I noticed a few things.
>>>
>>> First, I see in the logs:
>>>
>>> ====
>>> Jan 11 00:30:43 mk-a07n02.digimer.ca libvirtd[2937]: Cannot recv data:
>>> Connection reset by peer
>>> ====
>>>
>>> So with this, I further simplified my test script to this:
>>>
>>> https://pastebin.com/Ey8FdL1t
>>>
>>> Then when I ran my test script directly, the strace output is:
>>>
>>> Good: https://pastebin.com/Trbq67ub
>>>
>>> When my script is called via crm_resource, the strace is this:
>>>
>>> Bad: https://pastebin.com/jtbzHrUM
>>>
>>> The first difference I can see happens around line 929 in the good
>>> paste, the line "futex(0x7f48b0001ca0, FUTEX_WAKE_PRIVATE, 1) = 0"
>>> exists, which doesn't in the bad paste. Shortly after, I start seeing:
>>>
>>> ====
>>> line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8]
>>> line: [brk(NULL) = 0x562b7877d000]
>>> line: [brk(0x562b787aa000) = 0x562b787aa000]
>>> line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8]
>>> ====
>>>
>>> Around line 959 in the bad paste. There are more brk() lines, and not
>>> long after the output stops.
>>>
>>> --
>>> Madison Kelly
>>> Alteeve's Niche!
>>> Chief Technical Officer
>>> c: +1-647-471-0951
>>> https://alteeve.com/
>>>
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>
>
>
--
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/
More information about the Users
mailing list