[ClusterLabs] RA hangs when called by crm_resource (resending text format)
Madison Kelly
mkelly at alteeve.com
Wed Jan 11 01:06:37 EST 2023
On 2023-01-11 00:21, Madison Kelly wrote:
> On 2023-01-11 00:14, Madison Kelly wrote:
>> Hi all,
>>
>> Edit: Last message was in HTML format, sorry about that.
>>
>> I've got a hell of a weird problem, and I am absolutely stumped on
>> what's going on.
>>
>> The short of it is; if my RA is called from the command line, it's
>> fine. If a resource exists, monitor, enable, disable, all that stuff
>> works just fine. If I try to create a resource, it hangs on the
>> validate stage. Specifically, it hangs when 'pcs' calls:
>>
>> crm_resource --validate --output-as xml --class ocf --agent server
>> --provider alteeve --option name=<resource_name>
>>
>> Specifically, it hangs when it tries to make a shell call (to
>> virsh, specifically, but that doesn't matter). So to debug, I started
>> stripping down my RA simpler and simpler until I was left with the
>> very most basic of programs;
>>
>> https://pastebin.com/VtSpkwMr
>>
>> That is literally the simplest program I could write that made the
>> shell call. The 'open()' call is where it hangs.
>>
>> When I call directly;
>>
>> time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server
>> srv04-test; echo rc:$?
>>
>> ====
>> real 0m0.061s
>> user 0m0.037s
>> sys 0m0.014s
>> rc:0
>> ====
>>
>> It's just fine. I can see in the log the output from the 'virsh' call
>> as well. However, when I call from crm_resource;
>>
>> time crm_resource --validate --output-as xml --class ocf --agent
>> server --provider alteeve --option name=srv04-test; echo rc:$?
>>
>> ====
>> <pacemaker-result api-version="2.25" request="crm_resource --validate
>> --output-as xml --class ocf --agent server --provider alteeve --option
>> name=srv04-test">
>> <resource-agent-action action="validate" class="ocf" type="server"
>> provider="alteeve">
>> <overrides/>
>> <agent-status code="1" message="error" execution_code="2"
>> execution_message="Timed Out" reason="Resource agent did not exit
>> within specified timeout"/>
>> </resource-agent-action>
>> <status code="1" message="Error occurred">
>> <errors>
>> <error>crm_resource: Error performing operation: Error
>> occurred</error>
>> </errors>
>> </status>
>> </pacemaker-result>
>>
>> real 0m20.521s
>> user 0m0.022s
>> sys 0m0.010s
>> rc:1
>> ====
>>
>> In the log file, I see (from line 20 of the super-simple-test-script):
>>
>> ====
>> Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1;
>> /usr/bin/echo return_code:0 |]
>> ====
>>
>> Then nothing else.
>>
>> The strace output is: https://pastebin.com/raw/UCEUdBeP
>>
>> Environment;
>>
>> * selinux is permissive
>> * Pacemaker 2.1.5-4.el8
>> * pcs 0.10.15
>> * 4.18.0-408.el8.x86_64
>> * CentOS Stream release 8
>>
>> Any help is appreciated, I am stumped. :/
>
> After sending this, I tried having my "RA" call 'hostname', and that
> worked fine. I switched back to 'virsh list --all', and that hangs. So
> it seems to somehow be related to call 'virsh' specifically.
>
OK, so more info... Knowing now that it's a problem with the virsh call
specifically (but only when validating, existing VMs monitor, enable,
disable fine, all which repeatedly call virsh), I noticed a few things.
First, I see in the logs:
====
Jan 11 00:30:43 mk-a07n02.digimer.ca libvirtd[2937]: Cannot recv data:
Connection reset by peer
====
So with this, I further simplified my test script to this:
https://pastebin.com/Ey8FdL1t
Then when I ran my test script directly, the strace output is:
Good: https://pastebin.com/Trbq67ub
When my script is called via crm_resource, the strace is this:
Bad: https://pastebin.com/jtbzHrUM
The first difference I can see happens around line 929 in the good
paste, the line "futex(0x7f48b0001ca0, FUTEX_WAKE_PRIVATE, 1) = 0"
exists, which doesn't in the bad paste. Shortly after, I start seeing:
====
line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8]
line: [brk(NULL) = 0x562b7877d000]
line: [brk(0x562b787aa000) = 0x562b787aa000]
line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8]
====
Around line 959 in the bad paste. There are more brk() lines, and not
long after the output stops.
--
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/
More information about the Users
mailing list