[ClusterLabs Developers] problem with master score limited to 1000000

Mon Apr 27 06:47:56 EDT 2015

On Mon, Apr 27, 2015 at 10:56:58AM +0200, Jehan-Guillaume de Rorthais wrote:
> Hi Andrew,
> 
> On Mon, 27 Apr 2015 07:06:36 +1000
> Andrew Beekhof <andrew at beekhof.net> wrote:
> 
> > > On 25 Apr 2015, at 1:33 am, Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
> > > wrote:
> > > 
> > > We are writing a new resource agent for PostgreSQL (I am open to discuss why
> > > offlist to keep the thread clean) and are experiencing some limitation
> > > regarding to the master scoring in Pacemaker.
> > > 
> > > The only way in PostgreSQL to define which node should be promoted is to
> > > compare their location in their transaction log (called LSN). This LSN is
> > > expressed as a size that is obviously growing quickly.
> > 
> > We can look at bumping infinity, but what value would be acceptable?
> 
> I suppose most plateform support a value of 2^31-1 (~2 billion) as a simple 4
> bytes signed integer. But I can see two issues with this:
> 
>   * could it break the compatibility with other RA expecting "inf" to be
>     1,000,000?
>   * it just move the limit farther, but it doesn't solve the real problem. 
> 
> 
> In our situation, 2GB would probably be good in most situation, but consider
> this scenario:
> 
>   * monitor interval is 10 sec
>   * a table of 10GB is created on the master and streamed asynchronously to the
>     slaves
>   * the master crash
> 
> If at least 2GB has been streamed to the slaves, they will all have the same
> "inf" value.
> 
> > Would using "seconds since X" be an option instead?
> 
> I don't understand what you mean. Does it apply to my problem or to the "inf"
> consideration ? Could you elaborate ?

There is no reason to use your LSN as master score directly.

If I understand correctly, with your proposal of using the constantly
changing LSN as master score directly, you hope that pacemaker will,
in the event of Master failure, always have valid current information
about which the best Slave would be, and failover to there.

You also know that that's not exactly true, because the information
may be stale for "now - last-monitoring-interval", so you try to
figure out some clever way to not wait for the next monitor interval
of all instances, but still base the decision on the information
that you would have had, if you did wait for it.

I think that does not work:
you cannot base decisions on information you don't have.
You either wait for the information (and possible figure out a way to
request it on demand, or report it more frequently).
Or you knowingly decide on incomplete information, and prepare to deal
with the consequences of a potentially "wrong-in-hind-sight".

I suggest updating (changing!) the master score that frequently would even hurt.

What I think you should do is update the LSN (or whatever value you want
to base the decision on) "frequently enough" -- whatever that means --
and potentially "on demand". You may consider to NOT store it in the CIB
directly, but maybe as non-persistent attribute in attrd.

If you only store in attrd (and not the cib), you could update it much
more frequently, possibly by some "daemon" or trigger you start along
with the service.

You should start out without any master score (iirc, even master score
of 0 would allow promotion, only missing master score prevents pacemaker
from promoting).

N nodes, clone-max N, clone-node-max 1, master-max 1

During start, and monitor, you store the instance LSN in attrd.
If you see N instances started and all LSN reported,
if your LSN is (one of) the best LSN, set master score "7" (arbitrary).

Pacemaker (tries to) promote one of those.

If during monitor, you are the Master,
you bump (or keep) your master score at "9" (again, arbitrary).
(or not; maybe just keep it at "7" as it was before; changing it
may trigger a pengine run, we don't need).

If during monitor, or post-notify, you are Slave, and you see a Master,
remove your master score (because it was based on soon-to-be stale
information). You still update your LSN in attrd.

To generalize my previous statement,
if during monitor (or post notify; @beekhof: do we also get post-notify
on the Slave, if the Master failed, or its host was fenced?),
you see no running master, you see k instances, and f failed instances,
where f may be 0, k+f == N, you use the k "healthy" instances to base
your "is my LSN one of the best LSN" decision on.

You update the master score.
Pacemaker will handle the promotion.

I really feel that using some arbitrary, constantly changing, service
specific "goodness" value as master-score directly without any
transformation is a bad idea.

> A solution we were discussing with my colleague was to be able to break the
> current transition during the pre-promote and make sure a new transition is
> computed where pre-promote is called again. This would allow the RA needing
> complex election to have as many call of pre-promote as needed to take a
> decision, without waiting for a "monitor" action to keep going with the
> election process.
> 
> I noticed a transient attribute update already break a transition, like
> crm_master does if I understand it correctly. But I'm not sure how to create
> a custom transient attribute that would break the pre-promote for sure and
> re-trigger it ? Could we create a "promote-step" attribute which would be
> incremented as long as slaves are not happy with their election, re-triggering
> the pre-promote each time ?

-- 
: Lars Ellenberg
: http://www.LINBIT.com | Your Way to High Availability
: DRBD, Linux-HA  and  Pacemaker support and consulting

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.