[ClusterLabs Developers] scores and infinity

Wed Feb 12 16:11:41 EST 2020

On Wed, 2020-02-12 at 15:39 +0100, Jehan-Guillaume de Rorthais wrote:
> Hi,
> 
> As the PAF RA maintainer, I would like to discuss (sorry, again)
> something
> really painful: master scores and infinity.
> 
> PAF is a RA for PostgreSQL. The best known value to pick a master is
> the
> PostgreSQL's LSN (Log Sequence Number) which is a 64bits incremental
> counter.
> LSN is related to the volume of data written to the databases since
> the
> instance creation.
> 
> Each instance in the cluster (promoted or standby) reports its own
> LSN: 
> * the promoted reports its last written LSN
> * standbies report the last LSN they received
> 
> That's why LSN is the natural "master score" when there is no
> promoted clone
> around. Therefore, the lag of a standby is measured in bytes, based
> on this LSN.
> 
> Pacemaker master scores must fit between -1000000 and 1000000.
> Mapping this
> to LSN is impossible. Even if we can gather LSN diff between
> standbies (which
> would require a shared variable somewhere), this would be too small.
> 1000000 is
> only 1MB worth of lag. If we consider the minimal size of records in
> this log
> sequence number, we could stretch this to 24MB, but it's still way
> too small
> compared to some eg. network-bound workload where standby can lag way
> much more
> than few MB.
> 
> Because of this, we use (and abuse for other purposes) notifications
> to elect
> the best standby:
> 
> 0.   Pacemaker decides to promote one clone
> 1.1. during pre-promote, every clone set their LSN as a private
> attribute
> 1.2. the clone-to-promote track what clone takes part in the election
> in a
>      private attribute
> 2.   during the promotion, the clone-to-promote compares its LSN with
> LSN
>      set in 1.1 for each clone tracked in 1.2.
> 3.   if one clone LSN is greater than the local LSN
> 3.1  set a greater master score for the best candidate
> 3.2  returns an error
> 3.3   Pacemaker loops to 0
> 
> Higher bounds for ±INF would help a lot to make this simpler. After
> the primary
> is confirmed dead, all standby might just update how far they are
> from the
> latest checkpoint published by the master few seconds or minutes ago.
> 
> INT_MAX would set the working interval to ±2GB. Producing 2GB of
> worth of data
> in few seconds/minutes is possible, but considering the minimal XLOG
> record, this would push to 48GB. Good enough I suppose.
> 
> INT64_MAX would set the working interval to...±8EB. Here, no mater
> what you
> choose as master score, you still have some safety :)
> 
> So you think this is something worth working on? Is there some traps
> on the way
> that forbid using INT_MAX or INT64_MAX? Should I try to build a PoC
> to discuss
> it?

I do think it makes sense to expand the range, but there are some fuzzy
concerns. One reason the current range is so small is that it allows
summing a large number of scores without worrying about the possibility
of integer overflow. I'm not sure how important that is to the current
code but it's something that would take a lot of tedious inspection of
the scheduler code to make sure it would be OK.

In principle a 64-bit range makes sense to me. I think "INFINITY"
should be slightly less than half the max so at least two scores could
be added without concern, and then we could try to ensure that we never
add more than 2 scores at a time (using a function that checks for
infinity). Alternatively if we come up with a code object for "score"
that has a 64-bit int and a separate bit flag for infinity, we could
use the full range.

Unfortunately any change in the score will break backward compatibility
in the public C API, so it will have to be done when we are ready to
release a bunch of such changes. It would likely be a "2.1.0" release,
and probably not until 1-2 years from now. At least that gives us time
to investigate and come up with a design.

> Beside this master score limit, we suffer from these other
> constraints:
> 
> * attrd_updater is highly asynchronous:
>   * values are not yet available locally when the command exit
>   * ...neither they are from remote node
>   * we had to wrap it in a loop that wait for the change to become
> available
>     locally.

There's an RFE to offer a synchronous option to attrd_updater -- which
you knew already since you submitted it :) but I'll mention it in case
anyone else wants to follow it:

https://bugs.clusterlabs.org/show_bug.cgi?id=5347

It is definitely a goal, the question is always just developer time.

> * notification actions return code are ignored

It might be useful to support "on-fail" for the notify operation, and
default to "ignore" to preserve current behavior.

However the notify action is unique since it is associated with some
other action. Would a single "on-fail" for all notifications be enough,
or would we need some way to set different "on-fail" values for
pre/post and start/stop notifications?

If we did support on-fail for notify, that means that a default on-fail 
in op_defaults would begin to apply to notify. That might be
unexpected, especially for configurations that have long worked as-is
but might start causing problems in this case. I would want to wait at
least until a minor version bump (2.1.0), or maybe even a major bump
(3.0.0), though we could potentially make it available as a compile-
time option in the meantime.

Feel free to open an RFE.

> * OCF_RESKEY_CRM_meta_notify_* are available (officially) only during
>   notification action

That's a good question, whether the start/stop should be guaranteed to
have it as well. One question would be whether to use the pre- or post-
values.

Not directly related, but in the same vein, Andrew Beekhof proposed a
new promotable clone type, where promotion scores are discovered ahead
of time rather than after starting instances in slave mode. The idea
would be to have a new "discover" action in resource agents that would
output the master score (which would be called before starting any
instances), and then on one instance selected to be promoted, another
new action (like "bootstrap") would be called to do some initial start-
up that all instances need, before the cluster started all the other
instances normally (whether start or start+promote for multi-master).
That would be a large effort -- note Beekhof was not volunteering to do
it. :)

> These three points might probably be discussed in dedicated thread
> though.
> 
> Regards,
-- 
Ken Gaillot <kgaillot at redhat.com>