Most RA scripts are writen in bash.<div id="yMail_cursorElementTracker_1622239249817">Usually you can change the shebang to '!#/usr/bin/bash -x' or you can set trace_ra=1 via 'pcs resource update RESOURCE trace_ra=1 trace_file=/somepath'.</div><div id="yMail_cursorElementTracker_1622239377845"><br></div><div id="yMail_cursorElementTracker_1622239378556">If you don't define trace_file, it should create them in /var/lib/heartbeat/trace_ra (based on memory -> so use find/locate).</div><div id="yMail_cursorElementTracker_1622239515230"><br></div><div id="yMail_cursorElementTracker_1622239515437">Best Regards,</div><div id="yMail_cursorElementTracker_1622239520757">Strahil Nikolov</div><div id="yMail_cursorElementTracker_1622239524008"><br></div><div id="yMail_cursorElementTracker_1622239513238"> <blockquote style="margin: 0 0 20px 0;"> <div style="font-family:Roboto, sans-serif; color:#6D00F6;"> <div>On Fri, May 28, 2021 at 22:10, Abithan Kumarasamy</div><div><Abithan.Kumarasamy@ibm.com> wrote:</div> </div> <div style="padding: 10px 0 0 20px; margin: 10px 0 0 0; border-left: 1px solid #6D00F6;"> <div id="yiv2500575883"><div class="yiv2500575883socmaildefaultfont" dir="ltr" style="font-family:Arial, Helvetica, sans-serif;font-size:10pt;"><div dir="ltr"><div style="font-size:medium;">Hello Team,</div>
<div style="font-size:medium;"> </div>
<div style="font-size:medium;">We have been recently running some tests on our Pacemaker clusters that involve two Pacemaker resources on two nodes respectively. The test case in which we are experiencing intermittent problems is one in which we bring down the Pacemaker resources on both nodes simultaneously. Now our expected behaviour is that our monitor function in our resource agent script detects the downtime, and then should issue a start command. This happens on most successful iterations of our test case. However, on some iterations (approximately 1 out of 30 simulations) we notice that Pacemaker is issuing the start command on only one of the hosts. On the troubled host the monitor function is logging that the resource is down as expected and is exiting with OCF_ERR_GENERIC return code (1) . According to the documentation, this should perform a soft disaster recovery, but when scanning the Pacemaker logs, there is no indication of the start command being issued or invoked. However, it works as expected on the other host. </div>
<div style="font-size:medium;"> </div>
<div style="font-size:medium;">To summarize the issue:</div>
<ol> <li><span style="font-size:12pt;">The resource’s monitor is running and returning OCF_ERR_GENERIC</span></li> <li><span style="font-size:12pt;">The constraints we have for the resources are satisfied.</span></li> <li><span style="font-size:12pt;">There are no visible differences in the Pacemaker logs between the test iteration that failed, and the multiple successful iterations, other than the fact that Pacemaker does not start the resource after the monitor returns OCF_ERR_GENERIC</span><br> </li></ol>
<div style="font-size:medium;">Could you provide some more insight into why this may be happening and how we can further debug this issue? We are currently relying on Pacemaker logs, but are there additional diagnostics to further debug?<br> </div>
<div style="font-size:medium;"> </div>
<div style="font-size:medium;">Thanks,</div>
<div style="font-size:medium;">Abithan</div></div></div><br>
</div>_______________________________________________<br>Manage your subscription:<br><a href="https://lists.clusterlabs.org/mailman/listinfo/users" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br><br>ClusterLabs home: <a href="https://www.clusterlabs.org/" target="_blank">https://www.clusterlabs.org/</a><br> </div> </blockquote></div>