[ClusterLabs Developers] strange migration-threshold overflow, and fail-count update aborting it's own recovery transition

Fri Apr 5 13:50:46 UTC 2019

As mentioned in #clusterlabs,
but I think I post it here, so it won't get lost:

pacemaker 1.1.19, in case that matters.

"all good".

provoking resource monitoring failure
(manually umount of some file system)

monitoring failure triggers pengine run,
(input here is no fail-count in status section yet,
but failed monitoring operation)
results in "Recovery" transition.

which is then aborted by the fail-count=1 update of the very failure
that this recovery transition is about.

Meanwhile, the "stop" operation was already scheduled,
and results in "OK", so the second pengine run
now has as input a fail-count=1, and a stopped resource.

The second pengine run would usually come to the same result,
minus already completed actions, and no-one would notice.
I assume it has been like that for a long time?

But in this case, someone tried to be smart
and set a migration-threshold of "very large",
in this case the string in xml was: 999999999999, 
and that probably is "parsed" into some negative value,
which means the fail-count=1 now results in "forcing away ...",
different resource placements,
and the file system placement elsewhere now results in much more
actions, demoting/role changes/movement of other dependent resources ...

So I think we have two issues here:

a) I think the fail-count update should be visible as input in the cib
   before the pengine calculates the recovery transition.

b) migration-theshold (and possibly other scores) should be properly
   parsed/converted/capped/scaled/rejected

What do you think?

"someone" probably finds the relevant lines of code faster than I do ;-)

Cheers,

    Lars