[Pacemaker] Multi-site support in pacemaker (tokens, deadman, CTR)

Tue May 3 02:28:06 EDT 2011

On Mon, May 2, 2011 at 10:26 PM, Lars Marowsky-Bree <lmb at novell.com> wrote:
> On 2011-04-29T10:32:25, Andrew Beekhof <andrew at beekhof.net> wrote:
>
>> With such a long email, assume agreement for anything I don't
>> explicitly complain about :-)
>
> Sorry :-) I'm actually trying to write this up into a somewhat more
> consistent document just now, which turns out to be surprisingly hard
> ... Not that easily structured. I assume anything is better than nothing
> though.
>
>> > It's an excellent question where the configuration of the Cluster Token
>> > Registry would reside; I'd assume that there would be a
>> > resource/primitive/clone (design not finished) that corresponds to the
>> > daemon instance,
>> A resource or another daemon like crmd/cib/etc?
>> Could go either way I guess.
>
> Part of my goal is to have this as an add-on on top of Pacemaker.
> Ideally, short of the few PE/CIB enhancements, I'd love it if Pacemaker
> wouldn't even have to know about this.
>
> The tickets clearly can only be acquired if the rest of the cluster is
> up already, so having this as a clone makes some sense, and provides
> some monitoring of the service itself. (Similar to how ocfs2_controld is
> managed.)

Yep, not disagreeing.

>
>> > (I think the word works - you can own a ticket, grant a ticket, cancel,
>> > and revoke tickets ...)
>> Maybe.
>> I think token is still valid though.  Not like only one project in the
>> world uses heartbeats either.
>> (Naming a project after a generic term is another matter).
>
> I have had multiple people confused at the "token" word in the CTR and
> corosync contexts already. I just wanted to suggest to kill that as
> early as possible if we can ;-)

Shrug.  I guess whoever writes it gets to choose :-)

>> > Site-internal partitioning is handled at exactly that level; only the
>> > winning/quorate partition will be running the CTR daemon and
>> > re-establish communication with the other CTR instances. It will fence
>> > the losers.
>>
>> Ah, so thats why you suggested it be a resource.
>
> Yes.
>
>> Question though... what about no-quorum-policy=ignore ?
>
> That was implicit somewhere later on, I think. The CTR must be able to
> cope with multiple partitions of the same site, and would only grant the
> T to one of them.

But you'll still have a (longer) time when both partitions will think
they own the token.
Potentially long enough to begin starting resources.

>
>> > Probably it makes sense to add a layer of protection here to the CTR,
>> > though - if several partitions from the same site connect (which could,
>> > conceivably, happen), the CTRs will grant the ticket(s) only to the
>> > partition with the highest node count (or, should these be equal,
>> > lowest nodeid),
>> How about longest uptime instead?  Possibly too variable?
>
> That would work too, this was just to illustrate that there needs to be
> a unique tie-breaker of last resort that is guaranteed to break said
> tie.
>
>> >> Additionally, when a split-brain happens, how about the existing
>> >> stonith mechanism. Should the partition without quorum be stonithed?
>> > Yes, just as before.
>> Wouldn't that depend on whether a deadman constraint existed for one
>> of the lost tickets?
>
> Well, like I said: just as before. We don't have to STONITH anything if
> we know that the nodes are clean. But, by the way, we still do, since we
> don't trust nodes which failed. So unless we change the algorithm, the
> partitions would get shot already, and nothing wrong with that ... Or
> differently put: CTR doesn't require any change of behaviour here.

I'm not arguing that, I'm just saying I don't think we need an
additional construct.

>> Isn't kind=deadman for ordering constraints redundant though?
>
> It's not required for this approach, as far as I can see, since this
> only needs it for the T dependencies. I don't really care what else it
> gets added to ;-)

I do :-)

>
>> > Andrew, Yan - do you think we should allow _values_ for tickets, or
>> > should they be strictly defined/undefined/set/unset?
>> Unclear.  It might be nice to store the expiration (and/or last grant)
>> time in there for admin tools to do something with.
>> But that could mean a lot of spurious CIB updates, so maybe its better
>> to build that into the ticket daemon's api.
>
> I think sometime later in the discussion I actually made a case for
> certain values.
>
>> > The ticket not being set/defined should be identical to the ticket being
>> > set to "false/no", as far as I can see - in either case, the ticket is
>> > not owned, so all resources associated with it _must_ be stopped, and
>> > may not be started again.
>> There is a startup issue though.
>> You don't want to go fencing yourself before you can start the daemon
>> and attempt to get the token.
>>
>> But the fencing logic would presumably only happen if you DONT have
>> the ticket but DO have an affected resource active.
>
> Right. If you don't own anything that depends on the ticket that you
> haven't got, nothing happens.
>
> So no start-up issue - unless someone has misconfigured ticket-protected
> resources to be started outside the scope of Pacemaker, but that's
> deserved then ;-)

Yep. Just calling it out so that we have it written down somewhere.

>> > Good question. This came up above already briefly ...
>> >
>> > I _think_ there should be a special value that a ticket can be set to
>> > that doesn't fence, but stops everything cleanly.
>>
>> Again, wouldn't fencing only happen if a deadman dep made use of the ticket?
>
> Right, all of the above assumed that one actually had resources that
> depend on the ticket active. Otherwise, one wouldn't know which nodes to
> fence for this anyway.
>
>> Otherwise we probably want:
>>    <token id=... loss-policy=(fence|stop|freeze) granted=(true|false) />
>>
>> with the daemon only updating the "granted" field.
>
> Yeah. What I wanted to hint at above though was an
> owned-policy=(start|stop) to allow admins to cleanly stop the services
> even while still owning the ticket - and still be able to recover from a
> revocation properly (i.e., still fencing active resources).
>
>> > (Tangent - ownership appears to belong to the status section; the value
>> > seems belongs to the cib->ticket section(?).)
>> Plausible - since you'd not want nodes to come up and think they have tickets.
>> That would also negate my concern about including the expiration time
>> in the ticket.
>
> Right. One thing that ties into this here is the "how do tickets expire
> if the CTR dies on us", since then noone is around to revoke it from the
> CIB.
>
> I thought about handling this in the LRM, CIB, or PE (via the recheck
> interval), but they all suck. The cleanest and most reliable way seems
> to be to make death-of-ctr fatal for the nodes - just like
> ocfs2_controld or sbd via the watchdog.

agree

> But storing the acquisition time in the CIB probably is quite useful for
> the tools. I assume that typically we'll have <5 tickets around; an
> additional time stamp won't hurt us.

yep