[ClusterLabs] Pacemaker not issuing start command intermittently

Mon May 31 02:05:15 EDT 2021

On 5/29/21 12:05 AM, Strahil Nikolov wrote:
> Most RA scripts are writen in bash.
> Usually you can change the shebang to '!#/usr/bin/bash -x' or you can 
> set trace_ra=1 via 'pcs resource update RESOURCE trace_ra=1 
> trace_file=/somepath'.
>
> If you don't define trace_file, it should create them in 
> /var/lib/heartbeat/trace_ra (based on memory -> so use find/locate).
>
> Best Regards,
> Strahil Nikolov
>
>     On Fri, May 28, 2021 at 22:10, Abithan Kumarasamy
>     <Abithan.Kumarasamy at ibm.com> wrote:
>     Hello Team,
>     We have been recently running some tests on our Pacemaker clusters
>     that involve two Pacemaker resources on two nodes respectively.
>     The test case in which we are experiencing intermittent problems
>     is one in which we bring down the Pacemaker resources on both
>     nodes simultaneously. Now our expected behaviour is that our
>     monitor function in our resource agent script detects the
>     downtime, and then should issue a start command. This happens on
>     most successful iterations of our test case. However, on some
>     iterations (approximately 1 out of 30 simulations) we notice that
>     Pacemaker is issuing the start command on only one of the hosts.
>     On the troubled host the monitor function is logging that the
>     resource is down as expected and is exiting with OCF_ERR_GENERIC
>     return code (1) . According to the documentation, this should
>     perform a soft disaster recovery, but when scanning the Pacemaker
>     logs, there is no indication of the start command being issued or
>     invoked. However, it works as expected on the other host.
>     To summarize the issue:
>
>      1. The resource’s monitor is running and returning OCF_ERR_GENERIC
>      2. The constraints we have for the resources are satisfied.
>      3. There are no visible differences in the Pacemaker logs between
>         the test iteration that failed, and the multiple successful
>         iterations, other than the fact that Pacemaker does not start
>         the resource after the monitor returns OCF_ERR_GENERIC
>
In general pacemaker won't start a resource after receiving
OCF_ERR_GENERIC from the monitor. As you already mentioned
it will try to recover the resource to a known state by first
trying to stop and the state has to be reported as stopped
after that. Just then it will try to restart if rules say so.
Which Resource Agent are you using? If you brought down
the resource manually it shouldn't report OCF_ERR_GENERIC
but stopped.

Regards,
Klaus
>
>     1.
>
>     Could you provide some more insight into why this may be happening
>     and how we can further debug this issue? We are currently relying
>     on Pacemaker logs, but are there additional diagnostics to further
>     debug?
>     Thanks,
>     Abithan
>
>     _______________________________________________
>     Manage your subscription:
>     https://lists.clusterlabs.org/mailman/listinfo/users
>     <https://lists.clusterlabs.org/mailman/listinfo/users>
>
>     ClusterLabs home: https://www.clusterlabs.org/
>     <https://www.clusterlabs.org/>
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20210531/e02e26a6/attachment.htm>