<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<body>
<div dir="auto">
<div dir="auto">Then I would suggest to log all env vars and compare them, probably something is missing in validate for virsh to be happy.</div><div dir='auto'><br></div>
<div id="aqm-original" style="color: black;">
<div dir="auto">Madison Kelly <mkelly@alteeve.com> 11 января 2023 г. 22:06:45 написал:</div>
<div><br></div>
<blockquote type="cite" class="gmail_quote" style="margin: 0 0 0 0.75ex; border-left: 1px solid #808080; padding-left: 0.75ex;">
<div dir="auto">On 2023-01-11 01:13, Vladislav Bogdanov wrote:</div>
<blockquote type="cite" class="gmail_quote" style="margin: 0 0 0 0.75ex; border-left: 1px solid #0099CC; padding-left: 0.75ex;">
<div dir="auto">I suspect that valudate action is run as a non-root user.</div>
</blockquote>
<div dir="auto"><br></div>
<div dir="auto">I modified the script to log the real and effective UIDs and it's </div>
<div dir="auto">running as root in both instances.</div>
<div dir="auto"><br></div>
<blockquote type="cite" class="gmail_quote" style="margin: 0 0 0 0.75ex; border-left: 1px solid #0099CC; padding-left: 0.75ex;">
<div dir="auto">Madison Kelly <mkelly@alteeve.com> 11 января 2023 г. 07:06:55 написал:</div>
<div dir="auto"><br></div>
<blockquote type="cite" class="gmail_quote" style="margin: 0 0 0 0.75ex; border-left: 1px solid #9933CC; padding-left: 0.75ex;">
<div dir="auto">On 2023-01-11 00:21, Madison Kelly wrote:</div>
<blockquote type="cite" class="gmail_quote" style="margin: 0 0 0 0.75ex; border-left: 1px solid #669900; padding-left: 0.75ex;">
<div dir="auto">On 2023-01-11 00:14, Madison Kelly wrote:</div>
<blockquote type="cite" class="gmail_quote" style="margin: 0 0 0 0.75ex; border-left: 1px solid #FF8800; padding-left: 0.75ex;">
<div dir="auto">Hi all,</div>
<div dir="auto"><br></div>
<div dir="auto">Edit: Last message was in HTML format, sorry about that.</div>
<div dir="auto"><br></div>
<div dir="auto"> I've got a hell of a weird problem, and I am absolutely stumped on</div>
<div dir="auto">what's going on.</div>
<div dir="auto"><br></div>
<div dir="auto"> The short of it is; if my RA is called from the command line, it's</div>
<div dir="auto">fine. If a resource exists, monitor, enable, disable, all that stuff</div>
<div dir="auto">works just fine. If I try to create a resource, it hangs on the</div>
<div dir="auto">validate stage. Specifically, it hangs when 'pcs' calls:</div>
<div dir="auto"><br></div>
<div dir="auto">crm_resource --validate --output-as xml --class ocf --agent server</div>
<div dir="auto">--provider alteeve --option name=<resource_name></div>
<div dir="auto"><br></div>
<div dir="auto"> Specifically, it hangs when it tries to make a shell call (to</div>
<div dir="auto">virsh, specifically, but that doesn't matter). So to debug, I started</div>
<div dir="auto">stripping down my RA simpler and simpler until I was left with the</div>
<div dir="auto">very most basic of programs;</div>
<div dir="auto"><br></div>
<div dir="auto">https://pastebin.com/VtSpkwMr</div>
<div dir="auto"><br></div>
<div dir="auto"> That is literally the simplest program I could write that made the</div>
<div dir="auto">shell call. The 'open()' call is where it hangs.</div>
<div dir="auto"><br></div>
<div dir="auto">When I call directly;</div>
<div dir="auto"><br></div>
<div dir="auto">time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server</div>
<div dir="auto">srv04-test; echo rc:$?</div>
<div dir="auto"><br></div>
<div dir="auto">====</div>
<div dir="auto">real 0m0.061s</div>
<div dir="auto">user 0m0.037s</div>
<div dir="auto">sys 0m0.014s</div>
<div dir="auto">rc:0</div>
<div dir="auto">====</div>
<div dir="auto"><br></div>
<div dir="auto">It's just fine. I can see in the log the output from the 'virsh' call</div>
<div dir="auto">as well. However, when I call from crm_resource;</div>
<div dir="auto"><br></div>
<div dir="auto">time crm_resource --validate --output-as xml --class ocf --agent</div>
<div dir="auto">server --provider alteeve --option name=srv04-test; echo rc:$?</div>
<div dir="auto"><br></div>
<div dir="auto">====</div>
<div dir="auto"><pacemaker-result api-version="2.25" request="crm_resource --validate</div>
<div dir="auto">--output-as xml --class ocf --agent server --provider alteeve --option</div>
<div dir="auto">name=srv04-test"></div>
<div dir="auto"> <resource-agent-action action="validate" class="ocf" type="server"</div>
<div dir="auto">provider="alteeve"></div>
<div dir="auto"> <overrides/></div>
<div dir="auto"> <agent-status code="1" message="error" execution_code="2"</div>
<div dir="auto">execution_message="Timed Out" reason="Resource agent did not exit</div>
<div dir="auto">within specified timeout"/></div>
<div dir="auto"> </resource-agent-action></div>
<div dir="auto"> <status code="1" message="Error occurred"></div>
<div dir="auto"> <errors></div>
<div dir="auto"> <error>crm_resource: Error performing operation: Error</div>
<div dir="auto">occurred</error></div>
<div dir="auto"> </errors></div>
<div dir="auto"> </status></div>
<div dir="auto"></pacemaker-result></div>
<div dir="auto"><br></div>
<div dir="auto">real 0m20.521s</div>
<div dir="auto">user 0m0.022s</div>
<div dir="auto">sys 0m0.010s</div>
<div dir="auto">rc:1</div>
<div dir="auto">====</div>
<div dir="auto"><br></div>
<div dir="auto">In the log file, I see (from line 20 of the super-simple-test-script):</div>
<div dir="auto"><br></div>
<div dir="auto">====</div>
<div dir="auto">Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1;</div>
<div dir="auto">/usr/bin/echo return_code:0 |]</div>
<div dir="auto">====</div>
<div dir="auto"><br></div>
<div dir="auto">Then nothing else.</div>
<div dir="auto"><br></div>
<div dir="auto">The strace output is: https://pastebin.com/raw/UCEUdBeP</div>
<div dir="auto"><br></div>
<div dir="auto">Environment;</div>
<div dir="auto"><br></div>
<div dir="auto">* selinux is permissive</div>
<div dir="auto">* Pacemaker 2.1.5-4.el8</div>
<div dir="auto">* pcs 0.10.15</div>
<div dir="auto">* 4.18.0-408.el8.x86_64</div>
<div dir="auto">* CentOS Stream release 8</div>
<div dir="auto"><br></div>
<div dir="auto">Any help is appreciated, I am stumped. :/</div>
</blockquote>
<div dir="auto"><br></div>
<div dir="auto">After sending this, I tried having my "RA" call 'hostname', and that</div>
<div dir="auto">worked fine. I switched back to 'virsh list --all', and that hangs. So</div>
<div dir="auto">it seems to somehow be related to call 'virsh' specifically.</div>
<div dir="auto"><br></div>
</blockquote>
<div dir="auto"><br></div>
<div dir="auto">OK, so more info... Knowing now that it's a problem with the virsh call</div>
<div dir="auto">specifically (but only when validating, existing VMs monitor, enable,</div>
<div dir="auto">disable fine, all which repeatedly call virsh), I noticed a few things.</div>
<div dir="auto"><br></div>
<div dir="auto">First, I see in the logs:</div>
<div dir="auto"><br></div>
<div dir="auto">====</div>
<div dir="auto">Jan 11 00:30:43 mk-a07n02.digimer.ca libvirtd[2937]: Cannot recv data:</div>
<div dir="auto">Connection reset by peer</div>
<div dir="auto">====</div>
<div dir="auto"><br></div>
<div dir="auto">So with this, I further simplified my test script to this:</div>
<div dir="auto"><br></div>
<div dir="auto">https://pastebin.com/Ey8FdL1t</div>
<div dir="auto"><br></div>
<div dir="auto">Then when I ran my test script directly, the strace output is:</div>
<div dir="auto"><br></div>
<div dir="auto">Good: https://pastebin.com/Trbq67ub</div>
<div dir="auto"><br></div>
<div dir="auto">When my script is called via crm_resource, the strace is this:</div>
<div dir="auto"><br></div>
<div dir="auto">Bad: https://pastebin.com/jtbzHrUM</div>
<div dir="auto"><br></div>
<div dir="auto">The first difference I can see happens around line 929 in the good</div>
<div dir="auto">paste, the line "futex(0x7f48b0001ca0, FUTEX_WAKE_PRIVATE, 1) = 0"</div>
<div dir="auto">exists, which doesn't in the bad paste. Shortly after, I start seeing:</div>
<div dir="auto"><br></div>
<div dir="auto">====</div>
<div dir="auto">line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8]</div>
<div dir="auto">line: [brk(NULL) = 0x562b7877d000]</div>
<div dir="auto">line: [brk(0x562b787aa000) = 0x562b787aa000]</div>
<div dir="auto">line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8]</div>
<div dir="auto">====</div>
<div dir="auto"><br></div>
<div dir="auto">Around line 959 in the bad paste. There are more brk() lines, and not</div>
<div dir="auto">long after the output stops.</div>
<div dir="auto"><br></div>
<div dir="auto">-- </div>
<div dir="auto">Madison Kelly</div>
<div dir="auto">Alteeve's Niche!</div>
<div dir="auto">Chief Technical Officer</div>
<div dir="auto">c: +1-647-471-0951</div>
<div dir="auto">https://alteeve.com/</div>
<div dir="auto"><br></div>
<div dir="auto">_______________________________________________</div>
<div dir="auto">Manage your subscription:</div>
<div dir="auto">https://lists.clusterlabs.org/mailman/listinfo/users</div>
<div dir="auto"><br></div>
<div dir="auto">ClusterLabs home: https://www.clusterlabs.org/</div>
</blockquote>
<div dir="auto"><br></div>
</blockquote>
<div dir="auto"><br></div>
<div dir="auto">-- </div>
<div dir="auto">Madison Kelly</div>
<div dir="auto">Alteeve's Niche!</div>
<div dir="auto">Chief Technical Officer</div>
<div dir="auto">c: +1-647-471-0951</div>
<div dir="auto">https://alteeve.com/</div>
</blockquote>
</div><div dir="auto"><br></div>
</div></body>
</html>