[ClusterLabs] Pacemaker not issuing start command intermittently

Strahil Nikolov hunter86_bg at yahoo.com
Fri May 28 18:05:46 EDT 2021

Most RA scripts are writen in bash.Usually you can change the shebang to '!#/usr/bin/bash -x' or you can set trace_ra=1 via 'pcs resource update RESOURCE trace_ra=1 trace_file=/somepath'.
If you don't define trace_file, it should create them in /var/lib/heartbeat/trace_ra (based on memory -> so use find/locate).
Best Regards,Strahil Nikolov
  On Fri, May 28, 2021 at 22:10, Abithan Kumarasamy<Abithan.Kumarasamy at ibm.com> wrote:   Hello Team, We have been recently running some tests on our Pacemaker clusters that involve two Pacemaker resources on two nodes respectively. The test case in which we are experiencing intermittent problems is one in which we bring down the Pacemaker resources on both nodes simultaneously. Now our expected behaviour is that our monitor function in our resource agent script detects the downtime, and then should issue a start command. This happens on most successful iterations of our test case. However, on some iterations (approximately 1 out of 30 simulations) we notice that Pacemaker is issuing the start command on only one of the hosts. On the troubled host the monitor function is logging that the resource is down as expected and is exiting with OCF_ERR_GENERIC return code (1) . According to the documentation, this should perform a soft disaster recovery, but when scanning the Pacemaker logs, there is no indication of the start command being issued or invoked. However, it works as expected on the other host.  To summarize the issue:   
   - The resource’s monitor is running and returning OCF_ERR_GENERIC
   - The constraints we have for the resources are satisfied.
   - There are no visible differences in the Pacemaker logs between the test iteration that failed, and the multiple successful iterations, other than the fact that Pacemaker does not start the resource after the monitor returns OCF_ERR_GENERIC   
Could you provide some more insight into why this may be happening and how we can further debug this issue? We are currently relying on Pacemaker logs, but are there additional diagnostics to further debug?
Manage your subscription:

ClusterLabs home: https://www.clusterlabs.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20210528/86c54ba9/attachment.htm>

More information about the Users mailing list