[ClusterLabs Developers] scores and infinity

Wed Feb 12 14:39:21 UTC 2020

Hi,

As the PAF RA maintainer, I would like to discuss (sorry, again) something
really painful: master scores and infinity.

PAF is a RA for PostgreSQL. The best known value to pick a master is the
PostgreSQL's LSN (Log Sequence Number) which is a 64bits incremental counter.
LSN is related to the volume of data written to the databases since the
instance creation.

Each instance in the cluster (promoted or standby) reports its own LSN: 
* the promoted reports its last written LSN
* standbies report the last LSN they received

That's why LSN is the natural "master score" when there is no promoted clone
around. Therefore, the lag of a standby is measured in bytes, based on this LSN.

Pacemaker master scores must fit between -1000000 and 1000000. Mapping this
to LSN is impossible. Even if we can gather LSN diff between standbies (which
would require a shared variable somewhere), this would be too small. 1000000 is
only 1MB worth of lag. If we consider the minimal size of records in this log
sequence number, we could stretch this to 24MB, but it's still way too small
compared to some eg. network-bound workload where standby can lag way much more
than few MB.

Because of this, we use (and abuse for other purposes) notifications to elect
the best standby:

0.   Pacemaker decides to promote one clone
1.1. during pre-promote, every clone set their LSN as a private attribute
1.2. the clone-to-promote track what clone takes part in the election in a
     private attribute
2.   during the promotion, the clone-to-promote compares its LSN with LSN
     set in 1.1 for each clone tracked in 1.2.
3.   if one clone LSN is greater than the local LSN
3.1  set a greater master score for the best candidate
3.2  returns an error
3.3   Pacemaker loops to 0

Higher bounds for ±INF would help a lot to make this simpler. After the primary
is confirmed dead, all standby might just update how far they are from the
latest checkpoint published by the master few seconds or minutes ago.

INT_MAX would set the working interval to ±2GB. Producing 2GB of worth of data
in few seconds/minutes is possible, but considering the minimal XLOG
record, this would push to 48GB. Good enough I suppose.

INT64_MAX would set the working interval to...±8EB. Here, no mater what you
choose as master score, you still have some safety :)

So you think this is something worth working on? Is there some traps on the way
that forbid using INT_MAX or INT64_MAX? Should I try to build a PoC to discuss
it?

Beside this master score limit, we suffer from these other constraints:

* attrd_updater is highly asynchronous:
  * values are not yet available locally when the command exit
  * ...neither they are from remote node
  * we had to wrap it in a loop that wait for the change to become available
    locally.
* notification actions return code are ignored
* OCF_RESKEY_CRM_meta_notify_* are available (officially) only during
  notification action

These three points might probably be discussed in dedicated thread though.

Regards,