[Pacemaker] clear failcount when monitor is successful?

Lars Marowsky-Bree lmb at suse.com
Wed Apr 24 11:24:51 UTC 2013


On 2013-04-24T10:37:24, Johan Huysmans <johan.huysmans at inuits.be> wrote:

> --> start situation
> * scope=status  name=fail-count-d_tomcat value=0
> * depending resource group running on node
> * crm_mon shows everything ok
> 
> --> a failure occurs
> * scope=status  name=fail-count-d_tomcat value=1
> * depending resource group stopping on node
> * crm_mon shows failure
> 
> --> After 30s (= failure-timeout)
> * scope=status  name=fail-count-d_tomcat value=1
> * depending resource group not running on node
> * crm_mon shows NO failure !!!!!

This, by itself, is not necessarily surprising. The property
"cluster-reheck-interval" defines how often the PE gets re-run, and
defaults to 15 minutes.

This is not dynamically adjusted based on failure-timeouts, and if this
feature becomes more widely used, there probably should be a "better"
way to handle/trigger these while still avoiding swamping the cluster
with empty transitions etc.

In short: right now, if you want a failure-timeout of 30s to be
meaningful, you need to set cluster-recheck-interval to something
shorter.

> --> After something changes in the cluster or the recheck interval
> * scope=status  name=fail-count-d_tomcat value=0
> * depending resource group can run on node
> * crm_mon shows no failure
> * BUT my resource is still monitored and failing!

I'm not sure I perfectly get what you're saying here with the last
sentence. Did the cluster try to restart it, and it failed again, yet
the failure was ignored this time around?

> I find it disturbing that a resource with a failing monitor has a 0
> failcount, shows ok in crm_mon and allows to run the depending
> resources.

Yes, if I got that right, that would be a problem - please create a
hb_/crm_report and open a bug.



Regards,
    Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde





More information about the Pacemaker mailing list