[ClusterLabs Developers] scores and infinity

Thu Feb 13 09:11:48 EST 2020

On Wed, 12 Feb 2020 15:11:41 -0600
Ken Gaillot <kgaillot at redhat.com> wrote:
...
> > INT_MAX would set the working interval to ±2GB. Producing 2GB of
> > worth of data
> > in few seconds/minutes is possible, but considering the minimal XLOG
> > record, this would push to 48GB. Good enough I suppose.
> > 
> > INT64_MAX would set the working interval to...±8EB. Here, no mater
> > what you
> > choose as master score, you still have some safety :)
> > 
> > So you think this is something worth working on? Is there some traps
> > on the way
> > that forbid using INT_MAX or INT64_MAX? Should I try to build a PoC
> > to discuss
> > it?
> 
> I do think it makes sense to expand the range, but there are some fuzzy
> concerns. One reason the current range is so small is that it allows
> summing a large number of scores without worrying about the possibility
> of integer overflow.

Interesting, make sense.

> I'm not sure how important that is to the current code but it's something
> that would take a lot of tedious inspection of the scheduler code to make
> sure it would be OK.

I suppose adding some regression tests for each bug reported is the
policy?

> In principle a 64-bit range makes sense to me. I think "INFINITY"
> should be slightly less than half the max so at least two scores could
> be added without concern, and then we could try to ensure that we never
> add more than 2 scores at a time (using a function that checks for
> infinity).

Few weeks ago, Lars Ellenberg pointed me to merge_weight while discussing this
same issue on #clusterlabs:

https://github.com/ClusterLabs/pacemaker/blob/master/lib/pengine/common.c#L397

I suppose it's a good, first starting point.

> Alternatively if we come up with a code object for "score"
> that has a 64-bit int and a separate bit flag for infinity, we could
> use the full range.

A bit less than half of 2^63 is already 2EB and arithmetic stay simple. But a
score object wouldn't be too difficult too. I would give a try to the former
first, then extend to the second if it feels close enough as the hard would
already be done.

> Unfortunately any change in the score will break backward compatibility
> in the public C API, so it will have to be done when we are ready to
> release a bunch of such changes.

I'm not familiar with this C API. Any pointer?

> It would likely be a "2.1.0" release, and probably not until 1-2 years from
> now. At least that gives us time to investigate and come up with a design.

Sure, I'm not in a rush here anyway, that mail was in my drafts since...6 or 12
month maybe ? :)

> > Beside this master score limit, we suffer from these other
> > constraints:
> > 
> > * attrd_updater is highly asynchronous:
> >   * values are not yet available locally when the command exit
> >   * ...neither they are from remote node
> >   * we had to wrap it in a loop that wait for the change to become
> > available
> >     locally.
> 
> There's an RFE to offer a synchronous option to attrd_updater -- which
> you knew already since you submitted it :) but I'll mention it in case
> anyone else wants to follow it:
> 
> https://bugs.clusterlabs.org/show_bug.cgi?id=5347

I forgot about this one :)
I found a way to put a bandaid on this in PAF anyway.

> It is definitely a goal, the question is always just developer time.

sure

> > * notification actions return code are ignored

[..this is discussed in another thread..]

> > * OCF_RESKEY_CRM_meta_notify_* are available (officially) only during
> >   notification action
> 
> That's a good question, whether the start/stop should be guaranteed to
> have it as well.

and promote/demode.

> One question would be whether to use the pre- or post-values.

I would vote for pre-. When the action is called for eg. a start, the resource
is not started yet, so it should still appears in eg. "inactive".

> Not directly related, but in the same vein, Andrew Beekhof proposed a
> new promotable clone type, where promotion scores are discovered ahead
> of time rather than after starting instances in slave mode. The idea
> would be to have a new "discover" action in resource agents that would
> output the master score (which would be called before starting any
> instances),

This is an appealing idea. I'm not sure why a new type of promotable clone
would be required though. Adding this new operation to the existing OCF specs
for promotable clone would be enough, isn't it? As far as the RA expose
this operation in its meta-data, PEngine can decide to use it whenever it
needs to find some clone to promote. Not just before starting the
resource, eg. even after a primary loss when all secondary are already up.

This would greatly help to keep the code of RA clean and simple, with very low
cluster related-logic.

I love the idea :)

> and then on one instance selected to be promoted, another
> new action (like "bootstrap") would be called to do some initial start-
> up that all instances need, before the cluster started all the other
> instances normally (whether start or start+promote for multi-master).
> That would be a large effort -- note Beekhof was not volunteering to do
> it. :)

Not convinced about this one, not sure how useful this would be from my very
limited usecase though. If the idea is to eg. provision secondaries, I think
this is far from the responsibility of the RA or the cluster itself. I'm not
even sure some kind of eg. failback action would do. But maybe I misunderstood
the idea.

Regards,