[ClusterLabs] cluster-recheck-interval and failure-timeout

Wed Mar 31 10:58:30 EDT 2021

On Wednesday 31 March 2021 at 15:48:15, Ken Gaillot wrote:

> On Wed, 2021-03-31 at 14:32 +0200, Antony Stone wrote:
> > 
> > So, what am I misunderstanding about "failure-timeout", and what
> > configuration setting do I need to use to tell pacemaker that "provided the
> > resource hasn't failed within the past X seconds, forget the fact that it
> > failed more than X seconds ago"?
> 
> Unfortunately, there is no way. failure-timeout expires *all* failures
> once the *most recent* is that old.

I've re-read the above sentence, and in fact you seem to be agreeing with my 
expectation (which is not what happens).

> It's a bit counter-intuitive but currently, Pacemaker only remembers a
> resource's most recent failure and the total count of failures, and changing
> that would be a big project.

I'm only interested in the most recent failure.  I'm saying that once that 
failure is more than "failure-timeout" seconds old, I want the fact that the 
resource failed to be forgotten, so that it can be restarted or moved between 
nodes as normal, and not either be moved to another node just because (a) 
there were two failures last Friday and then one today, or (b) get stuck and 
not run on any nodes at all because all three nodes had three failures 
sometime in the past month.

Thanks,

Antony.

-- 
The Magic Words are Squeamish Ossifrage.

                                                   Please reply to the list;
                                                         please *don't* CC me.