[ClusterLabs] RA hangs when called by crm_resource (resending text format)

Vladislav Bogdanov bubble at hoster-ok.com
Wed Jan 11 16:23:41 EST 2023


And, one more thing can affect that - selinux. I doubt, but that's worth 
checking.

Vladislav Bogdanov <bubble at hoster-ok.com> 11 января 2023 г. 22:21:03 написал:
> Then I would suggest to log all env vars and compare them, probably 
> something is missing in validate for virsh to be happy.
>
> Madison Kelly <mkelly at alteeve.com> 11 января 2023 г. 22:06:45 написал:
>
>> On 2023-01-11 01:13, Vladislav Bogdanov wrote:
>>> I suspect that valudate action is run as a non-root user.
>>
>> I modified the script to log the real and effective UIDs and it's
>> running as root in both instances.
>>
>>> Madison Kelly <mkelly at alteeve.com> 11 января 2023 г. 07:06:55 написал:
>>>
>>>> On 2023-01-11 00:21, Madison Kelly wrote:
>>>>> On 2023-01-11 00:14, Madison Kelly wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> Edit: Last message was in HTML format, sorry about that.
>>>>>>
>>>>>> I've got a hell of a weird problem, and I am absolutely stumped on
>>>>>> what's going on.
>>>>>>
>>>>>> The short of it is; if my RA is called from the command line, it's
>>>>>> fine. If a resource exists, monitor, enable, disable, all that stuff
>>>>>> works just fine. If I try to create a resource, it hangs on the
>>>>>> validate stage. Specifically, it hangs when 'pcs' calls:
>>>>>>
>>>>>> crm_resource --validate --output-as xml --class ocf --agent server
>>>>>> --provider alteeve --option name=<resource_name>
>>>>>>
>>>>>> Specifically, it hangs when it tries to make a shell call (to
>>>>>> virsh, specifically, but that doesn't matter). So to debug, I started
>>>>>> stripping down my RA simpler and simpler until I was left with the
>>>>>> very most basic of programs;
>>>>>>
>>>>>> https://pastebin.com/VtSpkwMr
>>>>>>
>>>>>> That is literally the simplest program I could write that made the
>>>>>> shell call. The 'open()' call is where it hangs.
>>>>>>
>>>>>> When I call directly;
>>>>>>
>>>>>> time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server
>>>>>> srv04-test; echo rc:$?
>>>>>>
>>>>>> ====
>>>>>> real    0m0.061s
>>>>>> user    0m0.037s
>>>>>> sys    0m0.014s
>>>>>> rc:0
>>>>>> ====
>>>>>>
>>>>>> It's just fine. I can see in the log the output from the 'virsh' call
>>>>>> as well. However, when I call from crm_resource;
>>>>>>
>>>>>> time crm_resource --validate --output-as xml --class ocf --agent
>>>>>> server --provider alteeve --option name=srv04-test; echo rc:$?
>>>>>>
>>>>>> ====
>>>>>> <pacemaker-result api-version="2.25" request="crm_resource --validate
>>>>>> --output-as xml --class ocf --agent server --provider alteeve --option
>>>>>> name=srv04-test">
>>>>>> <resource-agent-action action="validate" class="ocf" type="server"
>>>>>> provider="alteeve">
>>>>>> <overrides/>
>>>>>> <agent-status code="1" message="error" execution_code="2"
>>>>>> execution_message="Timed Out" reason="Resource agent did not exit
>>>>>> within specified timeout"/>
>>>>>> </resource-agent-action>
>>>>>> <status code="1" message="Error occurred">
>>>>>> <errors>
>>>>>> <error>crm_resource: Error performing operation: Error
>>>>>> occurred</error>
>>>>>> </errors>
>>>>>> </status>
>>>>>> </pacemaker-result>
>>>>>>
>>>>>> real    0m20.521s
>>>>>> user    0m0.022s
>>>>>> sys    0m0.010s
>>>>>> rc:1
>>>>>> ====
>>>>>>
>>>>>> In the log file, I see (from line 20 of the super-simple-test-script):
>>>>>>
>>>>>> ====
>>>>>> Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1;
>>>>>> /usr/bin/echo return_code:0 |]
>>>>>> ====
>>>>>>
>>>>>> Then nothing else.
>>>>>>
>>>>>> The strace output is: https://pastebin.com/raw/UCEUdBeP
>>>>>>
>>>>>> Environment;
>>>>>>
>>>>>> * selinux is permissive
>>>>>> * Pacemaker 2.1.5-4.el8
>>>>>> * pcs 0.10.15
>>>>>> * 4.18.0-408.el8.x86_64
>>>>>> * CentOS Stream release 8
>>>>>>
>>>>>> Any help is appreciated, I am stumped. :/
>>>>>
>>>>> After sending this, I tried having my "RA" call 'hostname', and that
>>>>> worked fine. I switched back to 'virsh list --all', and that hangs. So
>>>>> it seems to somehow be related to call 'virsh' specifically.
>>>>
>>>> OK, so more info... Knowing now that it's a problem with the virsh call
>>>> specifically (but only when validating, existing VMs monitor, enable,
>>>> disable fine, all which repeatedly call virsh), I noticed a few things.
>>>>
>>>> First, I see in the logs:
>>>>
>>>> ====
>>>> Jan 11 00:30:43 mk-a07n02.digimer.ca libvirtd[2937]: Cannot recv data:
>>>> Connection reset by peer
>>>> ====
>>>>
>>>> So with this, I further simplified my test script to this:
>>>>
>>>> https://pastebin.com/Ey8FdL1t
>>>>
>>>> Then when I ran my test script directly, the strace output is:
>>>>
>>>> Good: https://pastebin.com/Trbq67ub
>>>>
>>>> When my script is called via crm_resource, the strace is this:
>>>>
>>>> Bad: https://pastebin.com/jtbzHrUM
>>>>
>>>> The first difference I can see happens around line 929 in the good
>>>> paste, the line "futex(0x7f48b0001ca0, FUTEX_WAKE_PRIVATE, 1) = 0"
>>>> exists, which doesn't in the bad paste. Shortly after, I start seeing:
>>>>
>>>> ====
>>>> line: [write(4, "\1\0\0\0\0\0\0\0", 8)         = 8]
>>>> line: [brk(NULL)                               = 0x562b7877d000]
>>>> line: [brk(0x562b787aa000)                     = 0x562b787aa000]
>>>> line: [write(4, "\1\0\0\0\0\0\0\0", 8)         = 8]
>>>> ====
>>>>
>>>> Around line 959 in the bad paste. There are more brk() lines, and not
>>>> long after the output stops.
>>>>
>>>> --
>>>> Madison Kelly
>>>> Alteeve's Niche!
>>>> Chief Technical Officer
>>>> c: +1-647-471-0951
>>>> https://alteeve.com/
>>>>
>>>> _______________________________________________
>>>> Manage your subscription:
>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>
>>>> ClusterLabs home: https://www.clusterlabs.org/
>>
>> --
>> Madison Kelly
>> Alteeve's Niche!
>> Chief Technical Officer
>> c: +1-647-471-0951
>> https://alteeve.com/
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20230111/039d65fc/attachment.htm>


More information about the Users mailing list