[ClusterLabs] clearing failed actions

Wed May 31 17:17:36 UTC 2017

On 05/30/2017 02:50 PM, Attila Megyeri wrote:
> Hi Ken,
> 
> 
>> -----Original Message-----
>> From: Ken Gaillot [mailto:kgaillot at redhat.com]
>> Sent: Tuesday, May 30, 2017 4:32 PM
>> To: users at clusterlabs.org
>> Subject: Re: [ClusterLabs] clearing failed actions
>>
>> On 05/30/2017 09:13 AM, Attila Megyeri wrote:
>>> Hi,
>>>
>>>
>>>
>>> Shouldn't the
>>>
>>>
>>>
>>> cluster-recheck-interval="2m"
>>>
>>>
>>>
>>> property instruct pacemaker to recheck the cluster every 2 minutes and
>>> clean the failcounts?
>>
>> It instructs pacemaker to recalculate whether any actions need to be
>> taken (including expiring any failcounts appropriately).
>>
>>> At the primitive level I also have a
>>>
>>>
>>>
>>> migration-threshold="30" failure-timeout="2m"
>>>
>>>
>>>
>>> but whenever I have a failure, it remains there forever.
>>>
>>>
>>>
>>>
>>>
>>> What could be causing this?
>>>
>>>
>>>
>>> thanks,
>>>
>>> Attila
>> Is it a single old failure, or a recurring failure? The failure timeout
>> works in a somewhat nonintuitive way. Old failures are not individually
>> expired. Instead, all failures of a resource are simultaneously cleared
>> if all of them are older than the failure-timeout. So if something keeps
>> failing repeatedly (more frequently than the failure-timeout), none of
>> the failures will be cleared.
>>
>> If it's not a repeating failure, something odd is going on.
> 
> It is not a repeating failure. Let's say that a resource fails for whatever action, It will remain in the failed actions (crm_mon -Af) until I issue a "crm resource cleanup <resource name>". Even after days or weeks, even though I see in the logs that cluster is rechecked every 120 seconds.
> 
> How could I troubleshoot this issue?
> 
> thanks!

Ah, I see what you're saying. That's expected behavior.

The failure-timeout applies to the failure *count* (which is used for
checking against migration-threshold), not the failure *history* (which
is used for the status display).

The idea is to have it no longer affect the cluster behavior, but still
allow an administrator to know that it happened. That's why a manual
cleanup is required to clear the history.