[ClusterLabs] cluster-recheck-interval and failure-timeout
Antony Stone
Antony.Stone at ha.open.source.it
Wed Mar 31 10:58:30 EDT 2021
On Wednesday 31 March 2021 at 15:48:15, Ken Gaillot wrote:
> On Wed, 2021-03-31 at 14:32 +0200, Antony Stone wrote:
> >
> > So, what am I misunderstanding about "failure-timeout", and what
> > configuration setting do I need to use to tell pacemaker that "provided the
> > resource hasn't failed within the past X seconds, forget the fact that it
> > failed more than X seconds ago"?
>
> Unfortunately, there is no way. failure-timeout expires *all* failures
> once the *most recent* is that old.
I've re-read the above sentence, and in fact you seem to be agreeing with my
expectation (which is not what happens).
> It's a bit counter-intuitive but currently, Pacemaker only remembers a
> resource's most recent failure and the total count of failures, and changing
> that would be a big project.
I'm only interested in the most recent failure. I'm saying that once that
failure is more than "failure-timeout" seconds old, I want the fact that the
resource failed to be forgotten, so that it can be restarted or moved between
nodes as normal, and not either be moved to another node just because (a)
there were two failures last Friday and then one today, or (b) get stuck and
not run on any nodes at all because all three nodes had three failures
sometime in the past month.
Thanks,
Antony.
--
The Magic Words are Squeamish Ossifrage.
Please reply to the list;
please *don't* CC me.
More information about the Users
mailing list