[ClusterLabs] clearing failed actions

Tue May 30 10:32:03 EDT 2017

On 05/30/2017 09:13 AM, Attila Megyeri wrote:
> Hi,
> 
>  
> 
> Shouldn’t the 
> 
>  
> 
> cluster-recheck-interval="2m"
> 
>  
> 
> property instruct pacemaker to recheck the cluster every 2 minutes and
> clean the failcounts?

It instructs pacemaker to recalculate whether any actions need to be
taken (including expiring any failcounts appropriately).

> At the primitive level I also have a
> 
>  
> 
> migration-threshold="30" failure-timeout="2m"
> 
>  
> 
> but whenever I have a failure, it remains there forever.
> 
>  
> 
>  
> 
> What could be causing this?
> 
>  
> 
> thanks,
> 
> Attila
Is it a single old failure, or a recurring failure? The failure timeout
works in a somewhat nonintuitive way. Old failures are not individually
expired. Instead, all failures of a resource are simultaneously cleared
if all of them are older than the failure-timeout. So if something keeps
failing repeatedly (more frequently than the failure-timeout), none of
the failures will be cleared.

If it's not a repeating failure, something odd is going on.