[ClusterLabs] RA hangs when called by crm_resource (resending text format)

Madison Kelly mkelly at alteeve.com
Wed Jan 11 14:01:24 EST 2023


On 2023-01-11 01:59, Reid Wahl wrote:
> On Tue, Jan 10, 2023 at 10:14 PM Vladislav Bogdanov
> <bubble at hoster-ok.com> wrote:
>>
>> I suspect that valudate action is run as a non-root user.
> 
> As far as I know, both the direct command and crm_resource **should**
> be running the agent as the same user, as long as Madison is running
> both commands as the same user.
> 
> For what it's worth, I copied your test script to my machine (Fedora
> 36 using the current upstream main of Pacemaker) and it worked fine
> both directly and via crm_resource. At the moment I'm not able to dig
> very deeply, but I do wonder if it's either a bug that's since been
> fixed, or perhaps an environment issue.
> 
> To try to rule out the former, do you have a test environment where
> you can try to reproduce it on the latest Pacemaker from upstream?

I am running both as the same (root, direct ssh, not sudo'd) user. I run 
them back-to-back with consistent results.

I've not built pacemaker in ages. Is there a src.rpm that's likely to 
build against centos stream 8 I could try? If not, do you know the 
command off and hand to create the rpm's from source? If not, I'll grab 
the source and read the docs for configure.

>> Madison Kelly <mkelly at alteeve.com> 11 января 2023 г. 07:06:55 написал:
>>
>>> On 2023-01-11 00:21, Madison Kelly wrote:
>>>>
>>>> On 2023-01-11 00:14, Madison Kelly wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> Edit: Last message was in HTML format, sorry about that.
>>>>>
>>>>>     I've got a hell of a weird problem, and I am absolutely stumped on
>>>>> what's going on.
>>>>>
>>>>>     The short of it is; if my RA is called from the command line, it's
>>>>> fine. If a resource exists, monitor, enable, disable, all that stuff
>>>>> works just fine. If I try to create a resource, it hangs on the
>>>>> validate stage. Specifically, it hangs when 'pcs' calls:
>>>>>
>>>>> crm_resource --validate --output-as xml --class ocf --agent server
>>>>> --provider alteeve --option name=<resource_name>
>>>>>
>>>>>     Specifically, it hangs when it tries to make a shell call (to
>>>>> virsh, specifically, but that doesn't matter). So to debug, I started
>>>>> stripping down my RA simpler and simpler until I was left with the
>>>>> very most basic of programs;
>>>>>
>>>>> https://pastebin.com/VtSpkwMr
>>>>>
>>>>>     That is literally the simplest program I could write that made the
>>>>> shell call. The 'open()' call is where it hangs.
>>>>>
>>>>> When I call directly;
>>>>>
>>>>> time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server
>>>>> srv04-test; echo rc:$?
>>>>>
>>>>> ====
>>>>> real    0m0.061s
>>>>> user    0m0.037s
>>>>> sys    0m0.014s
>>>>> rc:0
>>>>> ====
>>>>>
>>>>> It's just fine. I can see in the log the output from the 'virsh' call
>>>>> as well. However, when I call from crm_resource;
>>>>>
>>>>> time crm_resource --validate --output-as xml --class ocf --agent
>>>>> server --provider alteeve --option name=srv04-test; echo rc:$?
>>>>>
>>>>> ====
>>>>> <pacemaker-result api-version="2.25" request="crm_resource --validate
>>>>> --output-as xml --class ocf --agent server --provider alteeve --option
>>>>> name=srv04-test">
>>>>>     <resource-agent-action action="validate" class="ocf" type="server"
>>>>> provider="alteeve">
>>>>>       <overrides/>
>>>>>       <agent-status code="1" message="error" execution_code="2"
>>>>> execution_message="Timed Out" reason="Resource agent did not exit
>>>>> within specified timeout"/>
>>>>>     </resource-agent-action>
>>>>>     <status code="1" message="Error occurred">
>>>>>       <errors>
>>>>>         <error>crm_resource: Error performing operation: Error
>>>>> occurred</error>
>>>>>       </errors>
>>>>>     </status>
>>>>> </pacemaker-result>
>>>>>
>>>>> real    0m20.521s
>>>>> user    0m0.022s
>>>>> sys    0m0.010s
>>>>> rc:1
>>>>> ====
>>>>>
>>>>> In the log file, I see (from line 20 of the super-simple-test-script):
>>>>>
>>>>> ====
>>>>> Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1;
>>>>> /usr/bin/echo return_code:0 |]
>>>>> ====
>>>>>
>>>>> Then nothing else.
>>>>>
>>>>> The strace output is: https://pastebin.com/raw/UCEUdBeP
>>>>>
>>>>> Environment;
>>>>>
>>>>> * selinux is permissive
>>>>> * Pacemaker 2.1.5-4.el8
>>>>> * pcs 0.10.15
>>>>> * 4.18.0-408.el8.x86_64
>>>>> * CentOS Stream release 8
>>>>>
>>>>> Any help is appreciated, I am stumped. :/
>>>>
>>>>
>>>> After sending this, I tried having my "RA" call 'hostname', and that
>>>> worked fine. I switched back to 'virsh list --all', and that hangs. So
>>>> it seems to somehow be related to call 'virsh' specifically.
>>>>
>>>
>>> OK, so more info... Knowing now that it's a problem with the virsh call
>>> specifically (but only when validating, existing VMs monitor, enable,
>>> disable fine, all which repeatedly call virsh), I noticed a few things.
>>>
>>> First, I see in the logs:
>>>
>>> ====
>>> Jan 11 00:30:43 mk-a07n02.digimer.ca libvirtd[2937]: Cannot recv data:
>>> Connection reset by peer
>>> ====
>>>
>>> So with this, I further simplified my test script to this:
>>>
>>> https://pastebin.com/Ey8FdL1t
>>>
>>> Then when I ran my test script directly, the strace output is:
>>>
>>> Good: https://pastebin.com/Trbq67ub
>>>
>>> When my script is called via crm_resource, the strace is this:
>>>
>>> Bad: https://pastebin.com/jtbzHrUM
>>>
>>> The first difference I can see happens around line 929 in the good
>>> paste, the line "futex(0x7f48b0001ca0, FUTEX_WAKE_PRIVATE, 1) = 0"
>>> exists, which doesn't in the bad paste. Shortly after, I start seeing:
>>>
>>> ====
>>> line: [write(4, "\1\0\0\0\0\0\0\0", 8)         = 8]
>>> line: [brk(NULL)                               = 0x562b7877d000]
>>> line: [brk(0x562b787aa000)                     = 0x562b787aa000]
>>> line: [write(4, "\1\0\0\0\0\0\0\0", 8)         = 8]
>>> ====
>>>
>>> Around line 959 in the bad paste. There are more brk() lines, and not
>>> long after the output stops.
>>>
>>> --
>>> Madison Kelly
>>> Alteeve's Niche!
>>> Chief Technical Officer
>>> c: +1-647-471-0951
>>> https://alteeve.com/
>>>
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
> 
> 
> 

-- 
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/



More information about the Users mailing list