[ClusterLabs Developers] strange migration-threshold overflow, and fail-count update aborting it's own recovery transition

Fri Apr 5 10:56:51 EDT 2019

On Fri, 2019-04-05 at 09:44 -0500, Ken Gaillot wrote:
> On Fri, 2019-04-05 at 15:50 +0200, Lars Ellenberg wrote:
> > As mentioned in #clusterlabs,
> > but I think I post it here, so it won't get lost:
> > 
> > pacemaker 1.1.19, in case that matters.
> > 
> > "all good".
> > 
> > provoking resource monitoring failure
> > (manually umount of some file system)
> > 
> > monitoring failure triggers pengine run,
> > (input here is no fail-count in status section yet,
> > but failed monitoring operation)
> > results in "Recovery" transition.

Ah, I misunderstood ... I was thinking the two writes were the fail
count and the last failure time node attributes, but that's not it (and
probably already handled sufficiently).

The two writes are the recording of the operation status result,
followed by the recording of the fail count and last failure time node
attributes.

That's more challenging. The controller needs to see the status result
to determine whether it was a failure or not, at which point it sends
the fail count change to the attribute manager, which writes it to the
CIB, and the controller can only act on it then.

A possible solution would be to skip running the scheduler for the
status result if we're sending a fail count change. But that's risky
because it assumes everything involved in the fail count change will
succeed, and in a reasonable time. That's why the current approach is
to run the scheduler immediately and then re-run when the fail-count
change is confirmed.

> > 
> > 
> > which is then aborted by the fail-count=1 update of the very
> > failure
> > that this recovery transition is about.
> > 
> > Meanwhile, the "stop" operation was already scheduled,
> > and results in "OK", so the second pengine run
> > now has as input a fail-count=1, and a stopped resource.
> > 
> > The second pengine run would usually come to the same result,
> > minus already completed actions, and no-one would notice.
> > I assume it has been like that for a long time?
> 
> Yep, that's how it's always worked
> 
> > But in this case, someone tried to be smart
> > and set a migration-threshold of "very large",
> > in this case the string in xml was: 999999999999, 
> > and that probably is "parsed" into some negative value,
> 
> Anything above "INFINITY" (actually 1,000,000) should be mapped to
> INFINITY. If that's not what happens, there's a bug. Running
> crm_simulate in verbose mode should be helpful.
> 
> > which means the fail-count=1 now results in "forcing away ...",
> > different resource placements,
> > and the file system placement elsewhere now results in much more
> > actions, demoting/role changes/movement of other dependent
> > resources
> > ...
> > 
> > 
> > So I think we have two issues here:
> > 
> > a) I think the fail-count update should be visible as input in the
> > cib
> >    before the pengine calculates the recovery transition.
> 
> Agreed, but that's a nontrivial change to the current design.
> 
> Currently, once the controller detects a failure, it sends a request
> to
> the attribute manager to update the fail-count attribute, then
> immediately sends another request to update the last-failure time.
> The
> attribute manager writes these out to the CIB as it gets them, the
> CIB
> notifies the controller as each write completes, and the controller
> calls the scheduler when it receives each notification.
> 
> The ideal solution would be to have the controller send a single
> request to the attrd with both attributes, and have attrd write the
> values together. However the attribute manager's current
> client/server
> protocol and inner workings are designed for one attribute at a time.
> Writes can be grouped opportunistically or when a dampening is
> explicitly configured, but the main approach is to write each change
> as
> it is received.
> 
> There would also need to be some special handling for mixed-version
> clusters (i.e. keeping it working during a rolling upgrade when some
> nodes have the new capability and some don't).
> 
> It's certainly doable but a medium-sized project. If there are
> volunteers I can give some pointers :) but there's quite a backlog of
> projects at the moment.
> 
> > b) migration-theshold (and possibly other scores) should be
> > properly
> >    parsed/converted/capped/scaled/rejected
> 
> That should already be happening
> 
> > What do you think?
> > 
> > "someone" probably finds the relevant lines of code faster than I
> > do
> > ;-)
> > 
> > Cheers,
> > 
> >     Lars
-- 
Ken Gaillot <kgaillot at redhat.com>