[ClusterLabs] RA hangs when called by crm_resource (resending text format)

Reid Wahl nwahl at redhat.com
Wed Jan 11 16:12:28 EST 2023


On Wed, Jan 11, 2023 at 1:08 PM Madison Kelly <mkelly at alteeve.com> wrote:
>
> On 2023-01-11 15:55, Reid Wahl wrote:
> > On Wed, Jan 11, 2023 at 12:48 PM Madison Kelly <mkelly at alteeve.com> wrote:
> >>
> >> On 2023-01-11 01:59, Reid Wahl wrote:
> >>> On Tue, Jan 10, 2023 at 10:14 PM Vladislav Bogdanov
> >>> <bubble at hoster-ok.com> wrote:
> >>>>
> >>>> I suspect that valudate action is run as a non-root user.
> >>>
> >>> As far as I know, both the direct command and crm_resource **should**
> >>> be running the agent as the same user, as long as Madison is running
> >>> both commands as the same user.
> >>>
> >>> For what it's worth, I copied your test script to my machine (Fedora
> >>> 36 using the current upstream main of Pacemaker) and it worked fine
> >>> both directly and via crm_resource. At the moment I'm not able to dig
> >>> very deeply, but I do wonder if it's either a bug that's since been
> >>> fixed, or perhaps an environment issue.
> >>>
> >>> To try to rule out the former, do you have a test environment where
> >>> you can try to reproduce it on the latest Pacemaker from upstream?
> >>
> >> I built the pacemaker source RPM from Fedora 37, then realized I'm
> >> already running 2.1.5 on CS8, so I'm already on the latest release.
> >> Looking at git, 2.1.5 is the latest tagged release... Are you running
> >> newer than that?
> >
> > I'm running on the current main, which contains commits that came
> > after the 2.1.5 release. I don't really expect this to be a Pacemaker
> > bug, especially with how recent your version is, but I would like to
> > rule that out if possible.
>
> You would have either the src.rpm or the ./configure options you used
> off hand?

Running `make -C rpm rpm` like Ken said is probably the easiest way. I
normally build via `./autogen.sh && ./configure && make && sudo make
install`, but with an RPM your cleanup and stuff is taken care of for
you.
>
> >>>> Madison Kelly <mkelly at alteeve.com> 11 января 2023 г. 07:06:55 написал:
> >>>>
> >>>>> On 2023-01-11 00:21, Madison Kelly wrote:
> >>>>>>
> >>>>>> On 2023-01-11 00:14, Madison Kelly wrote:
> >>>>>>>
> >>>>>>> Hi all,
> >>>>>>>
> >>>>>>> Edit: Last message was in HTML format, sorry about that.
> >>>>>>>
> >>>>>>>      I've got a hell of a weird problem, and I am absolutely stumped on
> >>>>>>> what's going on.
> >>>>>>>
> >>>>>>>      The short of it is; if my RA is called from the command line, it's
> >>>>>>> fine. If a resource exists, monitor, enable, disable, all that stuff
> >>>>>>> works just fine. If I try to create a resource, it hangs on the
> >>>>>>> validate stage. Specifically, it hangs when 'pcs' calls:
> >>>>>>>
> >>>>>>> crm_resource --validate --output-as xml --class ocf --agent server
> >>>>>>> --provider alteeve --option name=<resource_name>
> >>>>>>>
> >>>>>>>      Specifically, it hangs when it tries to make a shell call (to
> >>>>>>> virsh, specifically, but that doesn't matter). So to debug, I started
> >>>>>>> stripping down my RA simpler and simpler until I was left with the
> >>>>>>> very most basic of programs;
> >>>>>>>
> >>>>>>> https://pastebin.com/VtSpkwMr
> >>>>>>>
> >>>>>>>      That is literally the simplest program I could write that made the
> >>>>>>> shell call. The 'open()' call is where it hangs.
> >>>>>>>
> >>>>>>> When I call directly;
> >>>>>>>
> >>>>>>> time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server
> >>>>>>> srv04-test; echo rc:$?
> >>>>>>>
> >>>>>>> ====
> >>>>>>> real    0m0.061s
> >>>>>>> user    0m0.037s
> >>>>>>> sys    0m0.014s
> >>>>>>> rc:0
> >>>>>>> ====
> >>>>>>>
> >>>>>>> It's just fine. I can see in the log the output from the 'virsh' call
> >>>>>>> as well. However, when I call from crm_resource;
> >>>>>>>
> >>>>>>> time crm_resource --validate --output-as xml --class ocf --agent
> >>>>>>> server --provider alteeve --option name=srv04-test; echo rc:$?
> >>>>>>>
> >>>>>>> ====
> >>>>>>> <pacemaker-result api-version="2.25" request="crm_resource --validate
> >>>>>>> --output-as xml --class ocf --agent server --provider alteeve --option
> >>>>>>> name=srv04-test">
> >>>>>>>      <resource-agent-action action="validate" class="ocf" type="server"
> >>>>>>> provider="alteeve">
> >>>>>>>        <overrides/>
> >>>>>>>        <agent-status code="1" message="error" execution_code="2"
> >>>>>>> execution_message="Timed Out" reason="Resource agent did not exit
> >>>>>>> within specified timeout"/>
> >>>>>>>      </resource-agent-action>
> >>>>>>>      <status code="1" message="Error occurred">
> >>>>>>>        <errors>
> >>>>>>>          <error>crm_resource: Error performing operation: Error
> >>>>>>> occurred</error>
> >>>>>>>        </errors>
> >>>>>>>      </status>
> >>>>>>> </pacemaker-result>
> >>>>>>>
> >>>>>>> real    0m20.521s
> >>>>>>> user    0m0.022s
> >>>>>>> sys    0m0.010s
> >>>>>>> rc:1
> >>>>>>> ====
> >>>>>>>
> >>>>>>> In the log file, I see (from line 20 of the super-simple-test-script):
> >>>>>>>
> >>>>>>> ====
> >>>>>>> Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1;
> >>>>>>> /usr/bin/echo return_code:0 |]
> >>>>>>> ====
> >>>>>>>
> >>>>>>> Then nothing else.
> >>>>>>>
> >>>>>>> The strace output is: https://pastebin.com/raw/UCEUdBeP
> >>>>>>>
> >>>>>>> Environment;
> >>>>>>>
> >>>>>>> * selinux is permissive
> >>>>>>> * Pacemaker 2.1.5-4.el8
> >>>>>>> * pcs 0.10.15
> >>>>>>> * 4.18.0-408.el8.x86_64
> >>>>>>> * CentOS Stream release 8
> >>>>>>>
> >>>>>>> Any help is appreciated, I am stumped. :/
> >>>>>>
> >>>>>>
> >>>>>> After sending this, I tried having my "RA" call 'hostname', and that
> >>>>>> worked fine. I switched back to 'virsh list --all', and that hangs. So
> >>>>>> it seems to somehow be related to call 'virsh' specifically.
> >>>>>>
> >>>>>
> >>>>> OK, so more info... Knowing now that it's a problem with the virsh call
> >>>>> specifically (but only when validating, existing VMs monitor, enable,
> >>>>> disable fine, all which repeatedly call virsh), I noticed a few things.
> >>>>>
> >>>>> First, I see in the logs:
> >>>>>
> >>>>> ====
> >>>>> Jan 11 00:30:43 mk-a07n02.digimer.ca libvirtd[2937]: Cannot recv data:
> >>>>> Connection reset by peer
> >>>>> ====
> >>>>>
> >>>>> So with this, I further simplified my test script to this:
> >>>>>
> >>>>> https://pastebin.com/Ey8FdL1t
> >>>>>
> >>>>> Then when I ran my test script directly, the strace output is:
> >>>>>
> >>>>> Good: https://pastebin.com/Trbq67ub
> >>>>>
> >>>>> When my script is called via crm_resource, the strace is this:
> >>>>>
> >>>>> Bad: https://pastebin.com/jtbzHrUM
> >>>>>
> >>>>> The first difference I can see happens around line 929 in the good
> >>>>> paste, the line "futex(0x7f48b0001ca0, FUTEX_WAKE_PRIVATE, 1) = 0"
> >>>>> exists, which doesn't in the bad paste. Shortly after, I start seeing:
> >>>>>
> >>>>> ====
> >>>>> line: [write(4, "\1\0\0\0\0\0\0\0", 8)         = 8]
> >>>>> line: [brk(NULL)                               = 0x562b7877d000]
> >>>>> line: [brk(0x562b787aa000)                     = 0x562b787aa000]
> >>>>> line: [write(4, "\1\0\0\0\0\0\0\0", 8)         = 8]
> >>>>> ====
> >>>>>
> >>>>> Around line 959 in the bad paste. There are more brk() lines, and not
> >>>>> long after the output stops.
> >>>>>
> >>>>> --
> >>>>> Madison Kelly
> >>>>> Alteeve's Niche!
> >>>>> Chief Technical Officer
> >>>>> c: +1-647-471-0951
> >>>>> https://alteeve.com/
> >>>>>
> >>>>> _______________________________________________
> >>>>> Manage your subscription:
> >>>>> https://lists.clusterlabs.org/mailman/listinfo/users
> >>>>>
> >>>>> ClusterLabs home: https://www.clusterlabs.org/
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Manage your subscription:
> >>>> https://lists.clusterlabs.org/mailman/listinfo/users
> >>>>
> >>>> ClusterLabs home: https://www.clusterlabs.org/
> >>>
> >>>
> >>>
> >>
> >> --
> >> Madison Kelly
> >> Alteeve's Niche!
> >> Chief Technical Officer
> >> c: +1-647-471-0951
> >> https://alteeve.com/
> >>
> >
> >
>
> --
> Madison Kelly
> Alteeve's Niche!
> Chief Technical Officer
> c: +1-647-471-0951
> https://alteeve.com/
>


-- 
Regards,

Reid Wahl (He/Him)
Senior Software Engineer, Red Hat
RHEL High Availability - Pacemaker



More information about the Users mailing list