[ClusterLabs] clearing failed actions

Wed May 31 18:04:13 EDT 2017

On 05/31/2017 12:17 PM, Ken Gaillot wrote:
> On 05/30/2017 02:50 PM, Attila Megyeri wrote:
>> Hi Ken,
>>
>>
>>> -----Original Message-----
>>> From: Ken Gaillot [mailto:kgaillot at redhat.com]
>>> Sent: Tuesday, May 30, 2017 4:32 PM
>>> To: users at clusterlabs.org
>>> Subject: Re: [ClusterLabs] clearing failed actions
>>>
>>> On 05/30/2017 09:13 AM, Attila Megyeri wrote:
>>>> Hi,
>>>>
>>>>
>>>>
>>>> Shouldn't the
>>>>
>>>>
>>>>
>>>> cluster-recheck-interval="2m"
>>>>
>>>>
>>>>
>>>> property instruct pacemaker to recheck the cluster every 2 minutes and
>>>> clean the failcounts?
>>>
>>> It instructs pacemaker to recalculate whether any actions need to be
>>> taken (including expiring any failcounts appropriately).
>>>
>>>> At the primitive level I also have a
>>>>
>>>>
>>>>
>>>> migration-threshold="30" failure-timeout="2m"
>>>>
>>>>
>>>>
>>>> but whenever I have a failure, it remains there forever.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> What could be causing this?
>>>>
>>>>
>>>>
>>>> thanks,
>>>>
>>>> Attila
>>> Is it a single old failure, or a recurring failure? The failure timeout
>>> works in a somewhat nonintuitive way. Old failures are not individually
>>> expired. Instead, all failures of a resource are simultaneously cleared
>>> if all of them are older than the failure-timeout. So if something keeps
>>> failing repeatedly (more frequently than the failure-timeout), none of
>>> the failures will be cleared.
>>>
>>> If it's not a repeating failure, something odd is going on.
>>
>> It is not a repeating failure. Let's say that a resource fails for whatever action, It will remain in the failed actions (crm_mon -Af) until I issue a "crm resource cleanup <resource name>". Even after days or weeks, even though I see in the logs that cluster is rechecked every 120 seconds.
>>
>> How could I troubleshoot this issue?
>>
>> thanks!
> 
> 
> Ah, I see what you're saying. That's expected behavior.
> 
> The failure-timeout applies to the failure *count* (which is used for
> checking against migration-threshold), not the failure *history* (which
> is used for the status display).
> 
> The idea is to have it no longer affect the cluster behavior, but still
> allow an administrator to know that it happened. That's why a manual
> cleanup is required to clear the history.

Hmm, I'm wrong there ... failure-timeout does expire the failure history
used for status display.

It works with the current versions. It's possible 1.1.10 had issues with
that.

Check the status to see which node is DC, and look at the pacemaker log
there after the failure occurred. There should be a message about the
failcount expiring. You can also look at the live CIB and search for
last_failure to see what is used for the display.