[ClusterLabs Developers] problem with master score limited to 1000000

Mon Apr 27 23:37:05 EDT 2015

> On 27 Apr 2015, at 11:10 pm, Jehan-Guillaume de Rorthais <jgdr at dalibo.com> wrote:
> 
> On Mon, 27 Apr 2015 12:47:56 +0200
> Lars Ellenberg <lars.ellenberg at linbit.com> wrote:
> 
>> On Mon, Apr 27, 2015 at 10:56:58AM +0200, Jehan-Guillaume de Rorthais wrote:
>>> Hi Andrew,
>>> 
>>> On Mon, 27 Apr 2015 07:06:36 +1000
>>> Andrew Beekhof <andrew at beekhof.net> wrote:
>>> 
>>>>> On 25 Apr 2015, at 1:33 am, Jehan-Guillaume de Rorthais
>>>>> <jgdr at dalibo.com> wrote:
>>>>> 
>>>>> We are writing a new resource agent for PostgreSQL (I am open to
>>>>> discuss why offlist to keep the thread clean) and are experiencing some
>>>>> limitation regarding to the master scoring in Pacemaker.
>>>>> 
>>>>> The only way in PostgreSQL to define which node should be promoted is to
>>>>> compare their location in their transaction log (called LSN). This LSN
>>>>> is expressed as a size that is obviously growing quickly.
>>>> 
>>>> We can look at bumping infinity, but what value would be acceptable?
>>> 
>>> I suppose most plateform support a value of 2^31-1 (~2 billion) as a simple
>>> 4 bytes signed integer. But I can see two issues with this:
>>> 
>>>  * could it break the compatibility with other RA expecting "inf" to be
>>>    1,000,000?
>>>  * it just move the limit farther, but it doesn't solve the real problem. 
>>> 
>>> 
>>> In our situation, 2GB would probably be good in most situation, but consider
>>> this scenario:
>>> 
>>>  * monitor interval is 10 sec
>>>  * a table of 10GB is created on the master and streamed asynchronously to
>>> the slaves
>>>  * the master crash
>>> 
>>> If at least 2GB has been streamed to the slaves, they will all have the same
>>> "inf" value.
>>> 
>>>> Would using "seconds since X" be an option instead?
>>> 
>>> I don't understand what you mean. Does it apply to my problem or to the
>>> "inf" consideration ? Could you elaborate ?
>> 
>> There is no reason to use your LSN as master score directly.
> 
> We agree. We already gave up with this idea, this is the whole point of this
> discussion :)
> 
>> If I understand correctly, with your proposal of using the constantly
>> changing LSN as master score directly, you hope that pacemaker will,
>> in the event of Master failure, always have valid current information
>> about which the best Slave would be, and failover to there.
> 
> Not constantly changing, no. The ideal scenario for us would be to set it
> during pre-promote **only**. After the Master failure, this LSN doesn't move
> anymore until a new master take over and writes occurs in it.
> 
> In our opinion, this should not be set during a monitor action, this is not its
> place and is not accurate enough.
> 
>> You also know that that's not exactly true, because the information
>> may be stale for "now - last-monitoring-interval", so you try to
>> figure out some clever way to not wait for the next monitor interval
>> of all instances, but still base the decision on the information
>> that you would have had, if you did wait for it.
> 
> Nope. We actually think monitor is not the good place to deal with a promotion
> related work. For the same reasons you are raising and because of the interval
> it introduces between a master failure and the reaction.
> 
>> I think that does not work:
>> you cannot base decisions on information you don't have.
>> You either wait for the information (and possible figure out a way to
>> request it on demand, or report it more frequently).
>> Or you knowingly decide on incomplete information, and prepare to deal
>> with the consequences of a potentially "wrong-in-hind-sight".
> 
> We 100% agree. Again, we want to avoid that.
> 
> In fact, this is one of the reason we are actually trying to write a new RA for
> PostgreSQL. The one distributed in the "resource-agents" deal with the
> promotion during the monitor action if there is no master.
> 
>> I suggest updating (changing!) the master score that frequently would even
>> hurt.
> 
> Definitely, There's no reason we would like to do that.
> 
>> What I think you should do is update the LSN (or whatever value you want
>> to base the decision on) "frequently enough" -- whatever that means --
>> and potentially "on demand". You may consider to NOT store it in the CIB
>> directly, but maybe as non-persistent attribute in attrd.
> 
> Agree. Our goal and our question here was all about how to share these LSN
> **on-demand** (during pre-promote) between slaves, then each RA can take a
> decision based on each other LSN to claim they can be a master setting
> crm_master to 100 for example or set crm_master to 1 in they lag behind.
> 
>> If you only store in attrd (and not the cib), you could update it much
>> more frequently, possibly by some "daemon" or trigger you start along
>> with the service.
> 
> This is not good enough. Race conditions are always knocking at your doors in
> such code. Moreover, it adds some complexity and we try to keep it as simple
> as possible.
> 
>> You should start out without any master score (iirc, even master score
>> of 0 would allow promotion, only missing master score prevents pacemaker
>> from promoting).
> 
> Yes, we start with a score of 1.
> 
>> N nodes, clone-max N, clone-node-max 1, master-max 1
>> 
>> During start, and monitor, you store the instance LSN in attrd.
>> If you see N instances started and all LSN reported,
>> if your LSN is (one of) the best LSN, set master score "7" (arbitrary).
> 
> How do you know you have the best LSN? How and when do you know all the N nodes
> set their LSN when you want to take your decision?
> 
> At least, after a first round of pre-promote, next time it is executed we know
> for sure all the node finished the first pre-promote action.
> 
>> Pacemaker (tries to) promote one of those.
>> 
>> If during monitor, you are the Master,
>> you bump (or keep) your master score at "9" (again, arbitrary).
>> (or not; maybe just keep it at "7" as it was before; changing it
>> may trigger a pengine run, we don't need).
>> 
>> If during monitor, or post-notify, you are Slave, and you see a Master,
>> remove your master score (because it was based on soon-to-be stale
>> information). You still update your LSN in attrd.
>> 
>> To generalize my previous statement,
>> if during monitor (or post notify; @beekhof: do we also get post-notify
>> on the Slave, if the Master failed, or its host was fenced?),
>> you see no running master, you see k instances, and f failed instances,
>> where f may be 0, k+f == N, you use the k "healthy" instances to base
>> your "is my LSN one of the best LSN" decision on.
> 
> Well, this make the promotion occurs on monitor action. This sounds like the
> bad place to take such a decision.
> 
> We really try to take this decision in pre-promote action, which seems like the
> only relevant place this.
> 
> Users are not supposed to know the implementation details of the RA. They
> should not have to know that a low monitor interval for slaves will make their
> failover faster.
> 
>> You update the master score.
>> Pacemaker will handle the promotion.
>> 
>> I really feel that using some arbitrary, constantly changing, service
>> specific "goodness" value as master-score directly without any
>> transformation is a bad idea.
> 
> We agree with that.
> 
>>> A solution we were discussing with my colleague was to be able to break the
>>> current transition during the pre-promote and make sure a new transition is
>>> computed where pre-promote is called again.

Realistically, this is not going to happen in the next few years.

Regardless of the idea’s merits, its a major change to one of our core assumptions.
Beyond the initial implementation, the fallout will last for months and I just don’t have that kind of bandwidth.

The idea is that by doing it in the monitor[1] op, you ensure you’re always in a position to do a promotion.
By all means query attrd from the promote and/or pre-promote operations to ensure that the chosen node is still the correct one though.

Give the pre-promote a decent timeout and it can also act as your "waiting for writes to come in and all LSNs to be updated” buffer.

[1] Strictly speaking, it could be any action name you dream up and tell the cluster to call on a recurring basis.
    Given that monitor is already defined and being called repeatedly, most people take the path of least resistance and use that (one less thing for an admin to mess up).

>>> This would allow the RA needing
>>> complex election to have as many call of pre-promote as needed to take a
>>> decision, without waiting for a "monitor" action to keep going with the
>>> election process.
>>> 
>>> I noticed a transient attribute update already break a transition, like
>>> crm_master does if I understand it correctly. But I'm not sure how to create
>>> a custom transient attribute that would break the pre-promote for sure and
>>> re-trigger it ? Could we create a "promote-step" attribute which would be
>>> incremented as long as slaves are not happy with their election,
>>> re-triggering the pre-promote each time ?
> 
> What we need after a master failed is to have the guaranty every node set their
> LSN attribute, then all of them can decide if they can be a master or not.
> 
> What I tried to describe above would allow us to create such scenario:
> 
>  1. master failing 
>  2. Pacemaker try to promote randomly one slave (they all have a master score
>     of 1)
>  3. during the pre-promote, each slave set their LSN somewhere (using
>     crm_attribute)
>  4. ...and break the current transition (using what I called the promote-step
>     attribute)
>  5. a new transition is called which triggers a pre-promote
>  6. slave compare their LSN to each other ones set in 3.1
>  7. the ones that have the highest LSN set the master score to 100
> 
> So first call of pre-promote concern 3 and 4. Second call of pre-promote 6 and
> 7.
> 
> 
> -- 
> Jehan-Guillaume de Rorthais
> Dalibo
> http://www.dalibo.com
> 
> _______________________________________________
> Developers mailing list
> Developers at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/developers