[ClusterLabs Developers] strange migration-threshold overflow, and fail-count update aborting it's own recovery transition

Fri Apr 5 16:20:04 UTC 2019

On Fri, 2019-04-05 at 17:19 +0200, Lars Ellenberg wrote:
> On Fri, Apr 05, 2019 at 09:56:51AM -0500, Ken Gaillot wrote:
> > On Fri, 2019-04-05 at 09:44 -0500, Ken Gaillot wrote:
> > > On Fri, 2019-04-05 at 15:50 +0200, Lars Ellenberg wrote:
> > > > As mentioned in #clusterlabs,
> > > > but I think I post it here, so it won't get lost:
> > > > 
> > > > pacemaker 1.1.19, in case that matters.
> > > > 
> > > > "all good".
> > > > 
> > > > provoking resource monitoring failure
> > > > (manually umount of some file system)
> > > > 
> > > > monitoring failure triggers pengine run,
> > > > (input here is no fail-count in status section yet,
> > > > but failed monitoring operation)
> > > > results in "Recovery" transition.
> > 
> > Ah, I misunderstood ... I was thinking the two writes were the fail
> > count and the last failure time node attributes, but that's not it
> > (and
> > probably already handled sufficiently).
> > 
> > The two writes are the recording of the operation status result,
> > followed by the recording of the fail count and last failure time
> > node
> > attributes.
> > 
> > That's more challenging. The controller needs to see the status
> > result
> > to determine whether it was a failure or not, at which point it
> > sends
> > the fail count change to the attribute manager, which writes it to
> > the
> > CIB, and the controller can only act on it then.
> > 
> > A possible solution would be to skip running the scheduler for the
> > status result if we're sending a fail count change. But that's
> > risky
> > because it assumes everything involved in the fail count change
> > will
> > succeed, and in a reasonable time. That's why the current approach
> > is
> > to run the scheduler immediately and then re-run when the fail-
> > count
> > change is confirmed.
> 
> I see.  So there is that.
> We can live with that, I guess,
> it is very unlikely to cause issues "in the real world".
> 
> > > > which is then aborted by the fail-count=1 update of the very
> > > > failure
> > > > that this recovery transition is about.
> > > > 
> > > > Meanwhile, the "stop" operation was already scheduled,
> > > > and results in "OK", so the second pengine run
> > > > now has as input a fail-count=1, and a stopped resource.
> > > > 
> > > > The second pengine run would usually come to the same result,
> > > > minus already completed actions, and no-one would notice.
> > > > I assume it has been like that for a long time?
> > > 
> > > Yep, that's how it's always worked
> > > 
> > > > But in this case, someone tried to be smart
> > > > and set a migration-threshold of "very large",
> > > > in this case the string in xml was: 999999999999, 
> > > > and that probably is "parsed" into some negative value,
> > > 
> > > Anything above "INFINITY" (actually 1,000,000) should be mapped
> > > to
> > > INFINITY. If that's not what happens, there's a bug. Running
> > > crm_simulate in verbose mode should be helpful.
> 
> I think I found it already.
> 
> char2score() does crm_parse_int(),
> and reasonably assumes that the result is the parsed int.
> Which it is not, if the result is -1, and errno is set to EINVAL or
> ERANGE ;-)

Ahhhhh :)

> 	 char2score -> crm_parse_int 
> 		 "999999999999"  -> result of strtoll is > INT_MAX,
> 		 result -1, errno ERANGE
> 	 migration_threshold = -1;
> 
> Not sure what to do there, though.
> Yet an other helper,
> mapping ERANGE to appropriate MIN/MAX for the conversion?

Yes, but we don't need a another helper ... it looks like the problem
is that crm_parse_int() maps ERANGE to -1. At that point, it has the
value from crm_parse_ll() which should be LONG_MIN or LONG_MAX, so it
should map it to INT_MIN or INT_MAX instead. I'll take care of that.

> But any "sane" configuration would not even trigger that.
> Where and how would we point out the "in-sane-ness" to the user,
> though?

crm_int_helper() will log an error if anything goes wrong, so I think
that's covered as far we can. Did you check to see if there was a
"Conversion ... was clipped" log message in your case? (We need the
above fix to actually clip it in this case instead of blowing up, but
at least there's a message.)

> > > > which means the fail-count=1 now results in "forcing away ...",
> > > > different resource placements,
> > > > and the file system placement elsewhere now results in much
> > > > more
> > > > actions, demoting/role changes/movement of other dependent
> > > > resources
> > > > ...
> > > > 
> > > > 
> > > > So I think we have two issues here:
> > > > 
> > > > a) I think the fail-count update should be visible as input in
> > > > the
> > > > cib
> > > >    before the pengine calculates the recovery transition.
> > > 
> > > Agreed, but that's a nontrivial change to the current design.
> > > 
> > > Currently, once the controller detects a failure, it sends a
> > > request
> > > to the attribute manager to update the fail-count attribute, then
> > > immediately sends another request to update the last-failure
> > > time.
> > > The attribute manager writes these out to the CIB as it gets
> > > them,
> > > the CIB notifies the controller as each write completes, and the
> > > controller calls the scheduler when it receives each
> > > notification.
> > > 
> > > The ideal solution would be to have the controller send a single
> > > request to the attrd with both attributes, and have attrd write
> > > the
> > > values together. However the attribute manager's current
> > > client/server protocol and inner workings are designed for one
> > > attribute at a time.  Writes can be grouped opportunistically or
> > > when a dampening is explicitly configured, but the main approach
> > > is
> > > to write each change as it is received.
> > > 
> > > There would also need to be some special handling for mixed-
> > > version
> > > clusters (i.e. keeping it working during a rolling upgrade when
> > > some
> > > nodes have the new capability and some don't).
> > > 
> > > It's certainly doable but a medium-sized project. If there are
> > > volunteers I can give some pointers :) but there's quite a
> > > backlog of
> > > projects at the moment.
> > > 
> > > > b) migration-theshold (and possibly other scores) should be
> > > > properly
> > > >    parsed/converted/capped/scaled/rejected
> > > 
> > > That should already be happening
> 
> 
-- 
Ken Gaillot <kgaillot at redhat.com>