[ClusterLabs Developers] problem with master score limited to 1000000

Mon Apr 27 07:42:10 EDT 2015

On 04/27/2015 12:47 PM, Lars Ellenberg wrote:
> On Mon, Apr 27, 2015 at 10:56:58AM +0200, Jehan-Guillaume de Rorthais wrote:
>> Hi Andrew,
>>
>> On Mon, 27 Apr 2015 07:06:36 +1000
>> Andrew Beekhof <andrew at beekhof.net> wrote:
>>
>>>> On 25 Apr 2015, at 1:33 am, Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
>>>> wrote:
>>>>
>>>> We are writing a new resource agent for PostgreSQL (I am open to discuss why
>>>> offlist to keep the thread clean) and are experiencing some limitation
>>>> regarding to the master scoring in Pacemaker.
>>>>
>>>> The only way in PostgreSQL to define which node should be promoted is to
>>>> compare their location in their transaction log (called LSN). This LSN is
>>>> expressed as a size that is obviously growing quickly.
>>>
>>> We can look at bumping infinity, but what value would be acceptable?
>>
>> I suppose most plateform support a value of 2^31-1 (~2 billion) as a simple 4
>> bytes signed integer. But I can see two issues with this:
>>
>>   * could it break the compatibility with other RA expecting "inf" to be
>>     1,000,000?
>>   * it just move the limit farther, but it doesn't solve the real problem. 
>>
>>
>> In our situation, 2GB would probably be good in most situation, but consider
>> this scenario:
>>
>>   * monitor interval is 10 sec
>>   * a table of 10GB is created on the master and streamed asynchronously to the
>>     slaves
>>   * the master crash
>>
>> If at least 2GB has been streamed to the slaves, they will all have the same
>> "inf" value.
>>
>>> Would using "seconds since X" be an option instead?
>>
>> I don't understand what you mean. Does it apply to my problem or to the "inf"
>> consideration ? Could you elaborate ?
> 
> There is no reason to use your LSN as master score directly.
> 
> If I understand correctly, with your proposal of using the constantly
> changing LSN as master score directly, you hope that pacemaker will,
> in the event of Master failure, always have valid current information
> about which the best Slave would be, and failover to there.
> 
> You also know that that's not exactly true, because the information
> may be stale for "now - last-monitoring-interval", so you try to
> figure out some clever way to not wait for the next monitor interval
> of all instances, but still base the decision on the information
> that you would have had, if you did wait for it.
> 
> I think that does not work:
> you cannot base decisions on information you don't have.
> You either wait for the information (and possible figure out a way to
> request it on demand, or report it more frequently).
> Or you knowingly decide on incomplete information, and prepare to deal
> with the consequences of a potentially "wrong-in-hind-sight".

[snipping all of the below]

Basically what you want to do is what the galera agent does now.

Fabio

> 
> I suggest updating (changing!) the master score that frequently would even hurt.
> 
> What I think you should do is update the LSN (or whatever value you want
> to base the decision on) "frequently enough" -- whatever that means --
> and potentially "on demand". You may consider to NOT store it in the CIB
> directly, but maybe as non-persistent attribute in attrd.
> 
> If you only store in attrd (and not the cib), you could update it much
> more frequently, possibly by some "daemon" or trigger you start along
> with the service.
> 
> You should start out without any master score (iirc, even master score
> of 0 would allow promotion, only missing master score prevents pacemaker
> from promoting).
> 
> N nodes, clone-max N, clone-node-max 1, master-max 1
> 
> During start, and monitor, you store the instance LSN in attrd.
> If you see N instances started and all LSN reported,
> if your LSN is (one of) the best LSN, set master score "7" (arbitrary).
> 
> Pacemaker (tries to) promote one of those.
> 
> If during monitor, you are the Master,
> you bump (or keep) your master score at "9" (again, arbitrary).
> (or not; maybe just keep it at "7" as it was before; changing it
> may trigger a pengine run, we don't need).
>   
> If during monitor, or post-notify, you are Slave, and you see a Master,
> remove your master score (because it was based on soon-to-be stale
> information). You still update your LSN in attrd.
> 
> To generalize my previous statement,
> if during monitor (or post notify; @beekhof: do we also get post-notify
> on the Slave, if the Master failed, or its host was fenced?),
> you see no running master, you see k instances, and f failed instances,
> where f may be 0, k+f == N, you use the k "healthy" instances to base
> your "is my LSN one of the best LSN" decision on.
> 
> You update the master score.
> Pacemaker will handle the promotion.
> 
> I really feel that using some arbitrary, constantly changing, service
> specific "goodness" value as master-score directly without any
> transformation is a bad idea.
> 
>> A solution we were discussing with my colleague was to be able to break the
>> current transition during the pre-promote and make sure a new transition is
>> computed where pre-promote is called again. This would allow the RA needing
>> complex election to have as many call of pre-promote as needed to take a
>> decision, without waiting for a "monitor" action to keep going with the
>> election process.
>>
>> I noticed a transient attribute update already break a transition, like
>> crm_master does if I understand it correctly. But I'm not sure how to create
>> a custom transient attribute that would break the pre-promote for sure and
>> re-trigger it ? Could we create a "promote-step" attribute which would be
>> incremented as long as slaves are not happy with their election, re-triggering
>> the pre-promote each time ?
>