[ClusterLabs] pending actions

Fri Mar 24 22:33:21 UTC 2017

On 03/07/2017 04:13 PM, Jehan-Guillaume de Rorthais wrote:
> Hi,
> 
> Occasionally, I find my cluster with one pending action not being executed for
> some minutes (I guess until the "PEngine Recheck Timer" elapse).
> 
> Running "crm_simulate -SL" shows the pending actions.
> 
> I'm still confused about how it can happens, why it happens and how to avoid
> this.

It's most likely a bug in the crmd, which schedules PE runs.

> Earlier today, I started my test cluster with 3 nodes and a master/slave
> resource[1], all with positive master score (1001, 1000 and 990), and the
> cluster kept the promote action as a pending action for 15 minutes. 
> 
> You will find in attachment the first 3 pengine inputs executed after the
> cluster startup.
> 
> What are the consequences if I set cluster-recheck-interval to 30s as instance?

The cluster would consume more CPU and I/O continually recalculating the
cluster state.

It would be nice to have some guidelines for cluster-recheck-interval
based on real-world usage, but it's just going by gut feeling at this
point. The cluster automatically recalculates when something
"interesting" happens -- a node comes or goes, a monitor fails, a node
attribute changes, etc. The cluster-recheck-interval is (1) a failsafe
for buggy situations like this, and (2) the maximum granularity of many
time-based checks such as rules. I would personally use at least 5
minutes, though less is probably reasonable, especially with simple
configurations (number of nodes/resources/constraints).

> Thanks in advance for your lights :)
> 
> Regards,
> 
> [1] here is the setup:
> http://dalibo.github.io/PAF/Quick_Start-CentOS-7.html#cluster-resource-creation-and-management

Feel free to open a bug report and include some logs around the time of
the incident (most importantly from the DC).