[ClusterLabs] Antw: [EXT] Re: RA hangs when called by crm_resource (resending text format)
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Thu Jan 12 02:29:17 EST 2023
>>> Madison Kelly <mkelly at alteeve.com> schrieb am 11.01.2023 um 22:06 in
Nachricht
<8a2f2d45-0419-8e97-1805-2998a9b838af at alteeve.com>:
> On 2023-01-11 01:13, Vladislav Bogdanov wrote:
>> I suspect that valudate action is run as a non-root user.
>
> I modified the script to log the real and effective UIDs and it's
> running as root in both instances.
I'm not running Redhat, but could it be one of the additional security
features (selinux)?
If possible maybe try to disable those for the test, or test your RA on
non-Redhat (just for testing).
Regards,
Ulrich
>
>> Madison Kelly <mkelly at alteeve.com> 11 января 2023 г. 07:06:55 написал:
>>
>>> On 2023-01-11 00:21, Madison Kelly wrote:
>>>> On 2023-01-11 00:14, Madison Kelly wrote:
>>>>> Hi all,
>>>>>
>>>>> Edit: Last message was in HTML format, sorry about that.
>>>>>
>>>>> I've got a hell of a weird problem, and I am absolutely stumped on
>>>>> what's going on.
>>>>>
>>>>> The short of it is; if my RA is called from the command line, it's
>>>>> fine. If a resource exists, monitor, enable, disable, all that stuff
>>>>> works just fine. If I try to create a resource, it hangs on the
>>>>> validate stage. Specifically, it hangs when 'pcs' calls:
>>>>>
>>>>> crm_resource --validate --output-as xml --class ocf --agent server
>>>>> --provider alteeve --option name=<resource_name>
>>>>>
>>>>> Specifically, it hangs when it tries to make a shell call (to
>>>>> virsh, specifically, but that doesn't matter). So to debug, I started
>>>>> stripping down my RA simpler and simpler until I was left with the
>>>>> very most basic of programs;
>>>>>
>>>>> https://pastebin.com/VtSpkwMr
>>>>>
>>>>> That is literally the simplest program I could write that made the
>>>>> shell call. The 'open()' call is where it hangs.
>>>>>
>>>>> When I call directly;
>>>>>
>>>>> time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server
>>>>> srv04-test; echo rc:$?
>>>>>
>>>>> ====
>>>>> real 0m0.061s
>>>>> user 0m0.037s
>>>>> sys 0m0.014s
>>>>> rc:0
>>>>> ====
>>>>>
>>>>> It's just fine. I can see in the log the output from the 'virsh' call
>>>>> as well. However, when I call from crm_resource;
>>>>>
>>>>> time crm_resource --validate --output-as xml --class ocf --agent
>>>>> server --provider alteeve --option name=srv04-test; echo rc:$?
>>>>>
>>>>> ====
>>>>> <pacemaker-result api-version="2.25" request="crm_resource --validate
>>>>> --output-as xml --class ocf --agent server --provider alteeve --option
>>>>> name=srv04-test">
>>>>> <resource-agent-action action="validate" class="ocf" type="server"
>>>>> provider="alteeve">
>>>>> <overrides/>
>>>>> <agent-status code="1" message="error" execution_code="2"
>>>>> execution_message="Timed Out" reason="Resource agent did not exit
>>>>> within specified timeout"/>
>>>>> </resource-agent-action>
>>>>> <status code="1" message="Error occurred">
>>>>> <errors>
>>>>> <error>crm_resource: Error performing operation: Error
>>>>> occurred</error>
>>>>> </errors>
>>>>> </status>
>>>>> </pacemaker-result>
>>>>>
>>>>> real 0m20.521s
>>>>> user 0m0.022s
>>>>> sys 0m0.010s
>>>>> rc:1
>>>>> ====
>>>>>
>>>>> In the log file, I see (from line 20 of the super-simple-test-script):
>>>>>
>>>>> ====
>>>>> Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1;
>>>>> /usr/bin/echo return_code:0 |]
>>>>> ====
>>>>>
>>>>> Then nothing else.
>>>>>
>>>>> The strace output is: https://pastebin.com/raw/UCEUdBeP
>>>>>
>>>>> Environment;
>>>>>
>>>>> * selinux is permissive
>>>>> * Pacemaker 2.1.5-4.el8
>>>>> * pcs 0.10.15
>>>>> * 4.18.0-408.el8.x86_64
>>>>> * CentOS Stream release 8
>>>>>
>>>>> Any help is appreciated, I am stumped. :/
>>>>
>>>> After sending this, I tried having my "RA" call 'hostname', and that
>>>> worked fine. I switched back to 'virsh list --all', and that hangs. So
>>>> it seems to somehow be related to call 'virsh' specifically.
>>>>
>>>
>>> OK, so more info... Knowing now that it's a problem with the virsh call
>>> specifically (but only when validating, existing VMs monitor, enable,
>>> disable fine, all which repeatedly call virsh), I noticed a few things.
>>>
>>> First, I see in the logs:
>>>
>>> ====
>>> Jan 11 00:30:43 mk-a07n02.digimer.ca libvirtd[2937]: Cannot recv data:
>>> Connection reset by peer
>>> ====
>>>
>>> So with this, I further simplified my test script to this:
>>>
>>> https://pastebin.com/Ey8FdL1t
>>>
>>> Then when I ran my test script directly, the strace output is:
>>>
>>> Good: https://pastebin.com/Trbq67ub
>>>
>>> When my script is called via crm_resource, the strace is this:
>>>
>>> Bad: https://pastebin.com/jtbzHrUM
>>>
>>> The first difference I can see happens around line 929 in the good
>>> paste, the line "futex(0x7f48b0001ca0, FUTEX_WAKE_PRIVATE, 1) = 0"
>>> exists, which doesn't in the bad paste. Shortly after, I start seeing:
>>>
>>> ====
>>> line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8]
>>> line: [brk(NULL) = 0x562b7877d000]
>>> line: [brk(0x562b787aa000) = 0x562b787aa000]
>>> line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8]
>>> ====
>>>
>>> Around line 959 in the bad paste. There are more brk() lines, and not
>>> long after the output stops.
>>>
>>> --
>>> Madison Kelly
>>> Alteeve's Niche!
>>> Chief Technical Officer
>>> c: +1-647-471-0951
>>> https://alteeve.com/
>>>
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
>>
>
> --
> Madison Kelly
> Alteeve's Niche!
> Chief Technical Officer
> c: +1-647-471-0951
> https://alteeve.com/
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
More information about the Users
mailing list