[ClusterLabs Developers] strange migration-threshold overflow, and fail-count update aborting it's own recovery transition

Fri Apr 5 14:44:36 UTC 2019

On Fri, 2019-04-05 at 15:50 +0200, Lars Ellenberg wrote:
> As mentioned in #clusterlabs,
> but I think I post it here, so it won't get lost:
> 
> pacemaker 1.1.19, in case that matters.
> 
> "all good".
> 
> provoking resource monitoring failure
> (manually umount of some file system)
> 
> monitoring failure triggers pengine run,
> (input here is no fail-count in status section yet,
> but failed monitoring operation)
> results in "Recovery" transition.
> 
> 
> which is then aborted by the fail-count=1 update of the very failure
> that this recovery transition is about.
> 
> Meanwhile, the "stop" operation was already scheduled,
> and results in "OK", so the second pengine run
> now has as input a fail-count=1, and a stopped resource.
> 
> The second pengine run would usually come to the same result,
> minus already completed actions, and no-one would notice.
> I assume it has been like that for a long time?

Yep, that's how it's always worked

> But in this case, someone tried to be smart
> and set a migration-threshold of "very large",
> in this case the string in xml was: 999999999999, 
> and that probably is "parsed" into some negative value,

Anything above "INFINITY" (actually 1,000,000) should be mapped to
INFINITY. If that's not what happens, there's a bug. Running
crm_simulate in verbose mode should be helpful.

> which means the fail-count=1 now results in "forcing away ...",
> different resource placements,
> and the file system placement elsewhere now results in much more
> actions, demoting/role changes/movement of other dependent resources
> ...
> 
> 
> So I think we have two issues here:
> 
> a) I think the fail-count update should be visible as input in the
> cib
>    before the pengine calculates the recovery transition.

Agreed, but that's a nontrivial change to the current design.

Currently, once the controller detects a failure, it sends a request to
the attribute manager to update the fail-count attribute, then
immediately sends another request to update the last-failure time. The
attribute manager writes these out to the CIB as it gets them, the CIB
notifies the controller as each write completes, and the controller
calls the scheduler when it receives each notification.

The ideal solution would be to have the controller send a single
request to the attrd with both attributes, and have attrd write the
values together. However the attribute manager's current client/server
protocol and inner workings are designed for one attribute at a time.
Writes can be grouped opportunistically or when a dampening is
explicitly configured, but the main approach is to write each change as
it is received.

There would also need to be some special handling for mixed-version
clusters (i.e. keeping it working during a rolling upgrade when some
nodes have the new capability and some don't).

It's certainly doable but a medium-sized project. If there are
volunteers I can give some pointers :) but there's quite a backlog of
projects at the moment.

> b) migration-theshold (and possibly other scores) should be properly
>    parsed/converted/capped/scaled/rejected

That should already be happening

> What do you think?
> 
> "someone" probably finds the relevant lines of code faster than I do
> ;-)
> 
> Cheers,
> 
>     Lars
-- 
Ken Gaillot <kgaillot at redhat.com>