[ClusterLabs Developers] scores and infinity

Ken Gaillot kgaillot at redhat.com
Mon Feb 17 17:12:47 EST 2020


On Thu, 2020-02-13 at 15:11 +0100, Jehan-Guillaume de Rorthais wrote:
> On Wed, 12 Feb 2020 15:11:41 -0600
> Ken Gaillot <kgaillot at redhat.com> wrote:
> ...
> > > INT_MAX would set the working interval to ±2GB. Producing 2GB of
> > > worth of data
> > > in few seconds/minutes is possible, but considering the minimal
> > > XLOG
> > > record, this would push to 48GB. Good enough I suppose.
> > > 
> > > INT64_MAX would set the working interval to...±8EB. Here, no
> > > mater
> > > what you
> > > choose as master score, you still have some safety :)
> > > 
> > > So you think this is something worth working on? Is there some
> > > traps
> > > on the way
> > > that forbid using INT_MAX or INT64_MAX? Should I try to build a
> > > PoC
> > > to discuss
> > > it?
> > 
> > I do think it makes sense to expand the range, but there are some
> > fuzzy
> > concerns. One reason the current range is so small is that it
> > allows
> > summing a large number of scores without worrying about the
> > possibility
> > of integer overflow.
> 
> Interesting, make sense.
> 
> > I'm not sure how important that is to the current code but it's
> > something
> > that would take a lot of tedious inspection of the scheduler code
> > to make
> > sure it would be OK.
> 
> I suppose adding some regression tests for each bug reported is the
> policy?

Yes, but that has a fairly small coverage of the code. Scores are used
so extensively that we'd have to check everywhere they're used to make
sure they can handle the change.

> 
> > In principle a 64-bit range makes sense to me. I think "INFINITY"
> > should be slightly less than half the max so at least two scores
> > could
> > be added without concern, and then we could try to ensure that we
> > never
> > add more than 2 scores at a time (using a function that checks for
> > infinity).
> 
> Few weeks ago, Lars Ellenberg pointed me to merge_weight while
> discussing this
> same issue on #clusterlabs:
> 
> https://github.com/ClusterLabs/pacemaker/blob/master/lib/pengine/common.c#L397
> 
> I suppose it's a good, first starting point.

Yep, that's what I had in mind.

> > Alternatively if we come up with a code object for "score"
> > that has a 64-bit int and a separate bit flag for infinity, we
> > could
> > use the full range.
> 
> A bit less than half of 2^63 is already 2EB and arithmetic stay
> simple. But a
> score object wouldn't be too difficult too. I would give a try to the
> former
> first, then extend to the second if it feels close enough as the hard
> would
> already be done.
> 
> > Unfortunately any change in the score will break backward
> > compatibility
> > in the public C API, so it will have to be done when we are ready
> > to
> > release a bunch of such changes.
> 
> I'm not familiar with this C API. Any pointer?

It's (somewhat) documented at:
https://clusterlabs.org/pacemaker/doxygen/

The core and scheduler APIs might be the only ones affected. An example
is the pe_node_t type which currently has an "int weight" member or
pe_resource_t with "int stickiness". There are some public functions
that use an int score too, such as char2score() and score2char().

Not that anyone actually uses the C API, but since we do make it
available, we have to be careful with backward compatibility.

> > It would likely be a "2.1.0" release, and probably not until 1-2
> > years from
> > now. At least that gives us time to investigate and come up with a
> > design.
> 
> Sure, I'm not in a rush here anyway, that mail was in my drafts
> since...6 or 12
> month maybe ? :)
> 
> > > Beside this master score limit, we suffer from these other
> > > constraints:
> > > 
> > > * attrd_updater is highly asynchronous:
> > >   * values are not yet available locally when the command exit
> > >   * ...neither they are from remote node
> > >   * we had to wrap it in a loop that wait for the change to
> > > become
> > > available
> > >     locally.
> > 
> > There's an RFE to offer a synchronous option to attrd_updater --
> > which
> > you knew already since you submitted it :) but I'll mention it in
> > case
> > anyone else wants to follow it:
> > 
> > https://bugs.clusterlabs.org/show_bug.cgi?id=5347
> 
> I forgot about this one :)
> I found a way to put a bandaid on this in PAF anyway.
> 
> > It is definitely a goal, the question is always just developer
> > time.
> 
> sure
> 
> > > * notification actions return code are ignored
> 
> [..this is discussed in another thread..]
> 
> > > * OCF_RESKEY_CRM_meta_notify_* are available (officially) only
> > > during
> > >   notification action
> > 
> > That's a good question, whether the start/stop should be guaranteed
> > to
> > have it as well.
> 
> and promote/demode.
> 
> > One question would be whether to use the pre- or post-values.
> 
> I would vote for pre-. When the action is called for eg. a start, the
> resource
> is not started yet, so it should still appears in eg. "inactive".
> 
> > Not directly related, but in the same vein, Andrew Beekhof proposed
> > a
> > new promotable clone type, where promotion scores are discovered
> > ahead
> > of time rather than after starting instances in slave mode. The
> > idea
> > would be to have a new "discover" action in resource agents that
> > would
> > output the master score (which would be called before starting any
> > instances),
> 
> This is an appealing idea. I'm not sure why a new type of promotable
> clone
> would be required though. Adding this new operation to the existing
> OCF specs
> for promotable clone would be enough, isn't it? As far as the RA
> expose
> this operation in its meta-data, PEngine can decide to use it
> whenever it
> needs to find some clone to promote. Not just before starting the
> resource, eg. even after a primary loss when all secondary are
> already up.

Makes sense

> This would greatly help to keep the code of RA clean and simple, with
> very low
> cluster related-logic.
> 
> I love the idea :)
> 
> > and then on one instance selected to be promoted, another
> > new action (like "bootstrap") would be called to do some initial
> > start-
> > up that all instances need, before the cluster started all the
> > other
> > instances normally (whether start or start+promote for multi-
> > master).
> > That would be a large effort -- note Beekhof was not volunteering
> > to do
> > it. :)
> 
> Not convinced about this one, not sure how useful this would be from
> my very
> limited usecase though. If the idea is to eg. provision secondaries,
> I think
> this is far from the responsibility of the RA or the cluster itself.
> I'm not
> even sure some kind of eg. failback action would do. But maybe I
> misunderstood
> the idea.

I forget the particular service where it would be helpful. I'll have to
ask Beekhof again.

> 
> Regards,
-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Developers mailing list