[ClusterLabs Developers] strange migration-threshold overflow, and fail-count update aborting it's own recovery transition

Fri Apr 5 11:19:30 EDT 2019

On Fri, Apr 05, 2019 at 09:56:51AM -0500, Ken Gaillot wrote:
> On Fri, 2019-04-05 at 09:44 -0500, Ken Gaillot wrote:
> > On Fri, 2019-04-05 at 15:50 +0200, Lars Ellenberg wrote:
> > > As mentioned in #clusterlabs,
> > > but I think I post it here, so it won't get lost:
> > > 
> > > pacemaker 1.1.19, in case that matters.
> > > 
> > > "all good".
> > > 
> > > provoking resource monitoring failure
> > > (manually umount of some file system)
> > > 
> > > monitoring failure triggers pengine run,
> > > (input here is no fail-count in status section yet,
> > > but failed monitoring operation)
> > > results in "Recovery" transition.
> 
> Ah, I misunderstood ... I was thinking the two writes were the fail
> count and the last failure time node attributes, but that's not it (and
> probably already handled sufficiently).
> 
> The two writes are the recording of the operation status result,
> followed by the recording of the fail count and last failure time node
> attributes.
> 
> That's more challenging. The controller needs to see the status result
> to determine whether it was a failure or not, at which point it sends
> the fail count change to the attribute manager, which writes it to the
> CIB, and the controller can only act on it then.
> 
> A possible solution would be to skip running the scheduler for the
> status result if we're sending a fail count change. But that's risky
> because it assumes everything involved in the fail count change will
> succeed, and in a reasonable time. That's why the current approach is
> to run the scheduler immediately and then re-run when the fail-count
> change is confirmed.

I see.  So there is that.
We can live with that, I guess,
it is very unlikely to cause issues "in the real world".

> > > which is then aborted by the fail-count=1 update of the very
> > > failure
> > > that this recovery transition is about.
> > > 
> > > Meanwhile, the "stop" operation was already scheduled,
> > > and results in "OK", so the second pengine run
> > > now has as input a fail-count=1, and a stopped resource.
> > > 
> > > The second pengine run would usually come to the same result,
> > > minus already completed actions, and no-one would notice.
> > > I assume it has been like that for a long time?
> > 
> > Yep, that's how it's always worked
> > 
> > > But in this case, someone tried to be smart
> > > and set a migration-threshold of "very large",
> > > in this case the string in xml was: 999999999999, 
> > > and that probably is "parsed" into some negative value,
> > 
> > Anything above "INFINITY" (actually 1,000,000) should be mapped to
> > INFINITY. If that's not what happens, there's a bug. Running
> > crm_simulate in verbose mode should be helpful.

I think I found it already.

char2score() does crm_parse_int(),
and reasonably assumes that the result is the parsed int.
Which it is not, if the result is -1, and errno is set to EINVAL or
ERANGE ;-)

	 char2score -> crm_parse_int 
		 "999999999999"  -> result of strtoll is > INT_MAX,
		 result -1, errno ERANGE
	 migration_threshold = -1;

Not sure what to do there, though.
Yet an other helper,
mapping ERANGE to appropriate MIN/MAX for the conversion?

But any "sane" configuration would not even trigger that.
Where and how would we point out the "in-sane-ness" to the user, though?

> > > which means the fail-count=1 now results in "forcing away ...",
> > > different resource placements,
> > > and the file system placement elsewhere now results in much more
> > > actions, demoting/role changes/movement of other dependent
> > > resources
> > > ...
> > > 
> > > 
> > > So I think we have two issues here:
> > > 
> > > a) I think the fail-count update should be visible as input in the
> > > cib
> > >    before the pengine calculates the recovery transition.
> > 
> > Agreed, but that's a nontrivial change to the current design.
> > 
> > Currently, once the controller detects a failure, it sends a request
> > to the attribute manager to update the fail-count attribute, then
> > immediately sends another request to update the last-failure time.
> > The attribute manager writes these out to the CIB as it gets them,
> > the CIB notifies the controller as each write completes, and the
> > controller calls the scheduler when it receives each notification.
> > 
> > The ideal solution would be to have the controller send a single
> > request to the attrd with both attributes, and have attrd write the
> > values together. However the attribute manager's current
> > client/server protocol and inner workings are designed for one
> > attribute at a time.  Writes can be grouped opportunistically or
> > when a dampening is explicitly configured, but the main approach is
> > to write each change as it is received.
> > 
> > There would also need to be some special handling for mixed-version
> > clusters (i.e. keeping it working during a rolling upgrade when some
> > nodes have the new capability and some don't).
> > 
> > It's certainly doable but a medium-sized project. If there are
> > volunteers I can give some pointers :) but there's quite a backlog of
> > projects at the moment.
> > 
> > > b) migration-theshold (and possibly other scores) should be
> > > properly
> > >    parsed/converted/capped/scaled/rejected
> > 
> > That should already be happening

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R&D, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT