[ClusterLabs] RA hangs when called by crm_resource (resending text format)

Madison Kelly mkelly at alteeve.com
Wed Jan 11 00:14:30 EST 2023


Hi all,

Edit: Last message was in HTML format, sorry about that.

   I've got a hell of a weird problem, and I am absolutely stumped on 
what's going on.

   The short of it is; if my RA is called from the command line, it's 
fine. If a resource exists, monitor, enable, disable, all that stuff 
works just fine. If I try to create a resource, it hangs on the validate 
stage. Specifically, it hangs when 'pcs' calls:

crm_resource --validate --output-as xml --class ocf --agent server 
--provider alteeve --option name=<resource_name>

   Specifically, it hangs when it tries to make a shell call (to virsh, 
specifically, but that doesn't matter). So to debug, I started stripping 
down my RA simpler and simpler until I was left with the very most basic 
of programs;

https://pastebin.com/VtSpkwMr

   That is literally the simplest program I could write that made the 
shell call. The 'open()' call is where it hangs.

When I call directly;

time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server 
srv04-test; echo rc:$?

====
real    0m0.061s
user    0m0.037s
sys    0m0.014s
rc:0
====

It's just fine. I can see in the log the output from the 'virsh' call as 
well. However, when I call from crm_resource;

time crm_resource --validate --output-as xml --class ocf --agent server 
--provider alteeve --option name=srv04-test; echo rc:$?

====
<pacemaker-result api-version="2.25" request="crm_resource --validate 
--output-as xml --class ocf --agent server --provider alteeve --option 
name=srv04-test">
   <resource-agent-action action="validate" class="ocf" type="server" 
provider="alteeve">
     <overrides/>
     <agent-status code="1" message="error" execution_code="2" 
execution_message="Timed Out" reason="Resource agent did not exit within 
specified timeout"/>
   </resource-agent-action>
   <status code="1" message="Error occurred">
     <errors>
       <error>crm_resource: Error performing operation: Error 
occurred</error>
     </errors>
   </status>
</pacemaker-result>

real    0m20.521s
user    0m0.022s
sys    0m0.010s
rc:1
====

In the log file, I see (from line 20 of the super-simple-test-script):

====
Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1; 
/usr/bin/echo return_code:0 |]
====

Then nothing else.

The strace output is: https://pastebin.com/raw/UCEUdBeP

Environment;

* selinux is permissive
* Pacemaker 2.1.5-4.el8
* pcs 0.10.15
* 4.18.0-408.el8.x86_64
* CentOS Stream release 8

Any help is appreciated, I am stumped. :/
-- 
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/


More information about the Users mailing list