[ClusterLabs] RA hangs when called by crm_resource (resending text format)

Madison Kelly mkelly at alteeve.com
Wed Jan 11 16:08:21 EST 2023


On 2023-01-11 15:55, Reid Wahl wrote:
> On Wed, Jan 11, 2023 at 12:48 PM Madison Kelly <mkelly at alteeve.com> wrote:
>>
>> On 2023-01-11 01:59, Reid Wahl wrote:
>>> On Tue, Jan 10, 2023 at 10:14 PM Vladislav Bogdanov
>>> <bubble at hoster-ok.com> wrote:
>>>>
>>>> I suspect that valudate action is run as a non-root user.
>>>
>>> As far as I know, both the direct command and crm_resource **should**
>>> be running the agent as the same user, as long as Madison is running
>>> both commands as the same user.
>>>
>>> For what it's worth, I copied your test script to my machine (Fedora
>>> 36 using the current upstream main of Pacemaker) and it worked fine
>>> both directly and via crm_resource. At the moment I'm not able to dig
>>> very deeply, but I do wonder if it's either a bug that's since been
>>> fixed, or perhaps an environment issue.
>>>
>>> To try to rule out the former, do you have a test environment where
>>> you can try to reproduce it on the latest Pacemaker from upstream?
>>
>> I built the pacemaker source RPM from Fedora 37, then realized I'm
>> already running 2.1.5 on CS8, so I'm already on the latest release.
>> Looking at git, 2.1.5 is the latest tagged release... Are you running
>> newer than that?
> 
> I'm running on the current main, which contains commits that came
> after the 2.1.5 release. I don't really expect this to be a Pacemaker
> bug, especially with how recent your version is, but I would like to
> rule that out if possible.

You would have either the src.rpm or the ./configure options you used 
off hand?

>>>> Madison Kelly <mkelly at alteeve.com> 11 января 2023 г. 07:06:55 написал:
>>>>
>>>>> On 2023-01-11 00:21, Madison Kelly wrote:
>>>>>>
>>>>>> On 2023-01-11 00:14, Madison Kelly wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Edit: Last message was in HTML format, sorry about that.
>>>>>>>
>>>>>>>      I've got a hell of a weird problem, and I am absolutely stumped on
>>>>>>> what's going on.
>>>>>>>
>>>>>>>      The short of it is; if my RA is called from the command line, it's
>>>>>>> fine. If a resource exists, monitor, enable, disable, all that stuff
>>>>>>> works just fine. If I try to create a resource, it hangs on the
>>>>>>> validate stage. Specifically, it hangs when 'pcs' calls:
>>>>>>>
>>>>>>> crm_resource --validate --output-as xml --class ocf --agent server
>>>>>>> --provider alteeve --option name=<resource_name>
>>>>>>>
>>>>>>>      Specifically, it hangs when it tries to make a shell call (to
>>>>>>> virsh, specifically, but that doesn't matter). So to debug, I started
>>>>>>> stripping down my RA simpler and simpler until I was left with the
>>>>>>> very most basic of programs;
>>>>>>>
>>>>>>> https://pastebin.com/VtSpkwMr
>>>>>>>
>>>>>>>      That is literally the simplest program I could write that made the
>>>>>>> shell call. The 'open()' call is where it hangs.
>>>>>>>
>>>>>>> When I call directly;
>>>>>>>
>>>>>>> time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server
>>>>>>> srv04-test; echo rc:$?
>>>>>>>
>>>>>>> ====
>>>>>>> real    0m0.061s
>>>>>>> user    0m0.037s
>>>>>>> sys    0m0.014s
>>>>>>> rc:0
>>>>>>> ====
>>>>>>>
>>>>>>> It's just fine. I can see in the log the output from the 'virsh' call
>>>>>>> as well. However, when I call from crm_resource;
>>>>>>>
>>>>>>> time crm_resource --validate --output-as xml --class ocf --agent
>>>>>>> server --provider alteeve --option name=srv04-test; echo rc:$?
>>>>>>>
>>>>>>> ====
>>>>>>> <pacemaker-result api-version="2.25" request="crm_resource --validate
>>>>>>> --output-as xml --class ocf --agent server --provider alteeve --option
>>>>>>> name=srv04-test">
>>>>>>>      <resource-agent-action action="validate" class="ocf" type="server"
>>>>>>> provider="alteeve">
>>>>>>>        <overrides/>
>>>>>>>        <agent-status code="1" message="error" execution_code="2"
>>>>>>> execution_message="Timed Out" reason="Resource agent did not exit
>>>>>>> within specified timeout"/>
>>>>>>>      </resource-agent-action>
>>>>>>>      <status code="1" message="Error occurred">
>>>>>>>        <errors>
>>>>>>>          <error>crm_resource: Error performing operation: Error
>>>>>>> occurred</error>
>>>>>>>        </errors>
>>>>>>>      </status>
>>>>>>> </pacemaker-result>
>>>>>>>
>>>>>>> real    0m20.521s
>>>>>>> user    0m0.022s
>>>>>>> sys    0m0.010s
>>>>>>> rc:1
>>>>>>> ====
>>>>>>>
>>>>>>> In the log file, I see (from line 20 of the super-simple-test-script):
>>>>>>>
>>>>>>> ====
>>>>>>> Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1;
>>>>>>> /usr/bin/echo return_code:0 |]
>>>>>>> ====
>>>>>>>
>>>>>>> Then nothing else.
>>>>>>>
>>>>>>> The strace output is: https://pastebin.com/raw/UCEUdBeP
>>>>>>>
>>>>>>> Environment;
>>>>>>>
>>>>>>> * selinux is permissive
>>>>>>> * Pacemaker 2.1.5-4.el8
>>>>>>> * pcs 0.10.15
>>>>>>> * 4.18.0-408.el8.x86_64
>>>>>>> * CentOS Stream release 8
>>>>>>>
>>>>>>> Any help is appreciated, I am stumped. :/
>>>>>>
>>>>>>
>>>>>> After sending this, I tried having my "RA" call 'hostname', and that
>>>>>> worked fine. I switched back to 'virsh list --all', and that hangs. So
>>>>>> it seems to somehow be related to call 'virsh' specifically.
>>>>>>
>>>>>
>>>>> OK, so more info... Knowing now that it's a problem with the virsh call
>>>>> specifically (but only when validating, existing VMs monitor, enable,
>>>>> disable fine, all which repeatedly call virsh), I noticed a few things.
>>>>>
>>>>> First, I see in the logs:
>>>>>
>>>>> ====
>>>>> Jan 11 00:30:43 mk-a07n02.digimer.ca libvirtd[2937]: Cannot recv data:
>>>>> Connection reset by peer
>>>>> ====
>>>>>
>>>>> So with this, I further simplified my test script to this:
>>>>>
>>>>> https://pastebin.com/Ey8FdL1t
>>>>>
>>>>> Then when I ran my test script directly, the strace output is:
>>>>>
>>>>> Good: https://pastebin.com/Trbq67ub
>>>>>
>>>>> When my script is called via crm_resource, the strace is this:
>>>>>
>>>>> Bad: https://pastebin.com/jtbzHrUM
>>>>>
>>>>> The first difference I can see happens around line 929 in the good
>>>>> paste, the line "futex(0x7f48b0001ca0, FUTEX_WAKE_PRIVATE, 1) = 0"
>>>>> exists, which doesn't in the bad paste. Shortly after, I start seeing:
>>>>>
>>>>> ====
>>>>> line: [write(4, "\1\0\0\0\0\0\0\0", 8)         = 8]
>>>>> line: [brk(NULL)                               = 0x562b7877d000]
>>>>> line: [brk(0x562b787aa000)                     = 0x562b787aa000]
>>>>> line: [write(4, "\1\0\0\0\0\0\0\0", 8)         = 8]
>>>>> ====
>>>>>
>>>>> Around line 959 in the bad paste. There are more brk() lines, and not
>>>>> long after the output stops.
>>>>>
>>>>> --
>>>>> Madison Kelly
>>>>> Alteeve's Niche!
>>>>> Chief Technical Officer
>>>>> c: +1-647-471-0951
>>>>> https://alteeve.com/
>>>>>
>>>>> _______________________________________________
>>>>> Manage your subscription:
>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>
>>>>> ClusterLabs home: https://www.clusterlabs.org/
>>>>
>>>>
>>>> _______________________________________________
>>>> Manage your subscription:
>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>
>>>> ClusterLabs home: https://www.clusterlabs.org/
>>>
>>>
>>>
>>
>> --
>> Madison Kelly
>> Alteeve's Niche!
>> Chief Technical Officer
>> c: +1-647-471-0951
>> https://alteeve.com/
>>
> 
> 

-- 
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/



More information about the Users mailing list