[Pacemaker] Multi-site support in pacemaker (tokens, deadman, CTR)

Mon May 2 16:26:59 EDT 2011

On 2011-04-29T10:32:25, Andrew Beekhof <andrew at beekhof.net> wrote:

> With such a long email, assume agreement for anything I don't
> explicitly complain about :-)

Sorry :-) I'm actually trying to write this up into a somewhat more
consistent document just now, which turns out to be surprisingly hard
... Not that easily structured. I assume anything is better than nothing
though.

> > It's an excellent question where the configuration of the Cluster Token
> > Registry would reside; I'd assume that there would be a
> > resource/primitive/clone (design not finished) that corresponds to the
> > daemon instance,
> A resource or another daemon like crmd/cib/etc?
> Could go either way I guess.

Part of my goal is to have this as an add-on on top of Pacemaker.
Ideally, short of the few PE/CIB enhancements, I'd love it if Pacemaker
wouldn't even have to know about this.

The tickets clearly can only be acquired if the rest of the cluster is
up already, so having this as a clone makes some sense, and provides
some monitoring of the service itself. (Similar to how ocfs2_controld is
managed.)

> > (I think the word works - you can own a ticket, grant a ticket, cancel,
> > and revoke tickets ...)
> Maybe.
> I think token is still valid though.  Not like only one project in the
> world uses heartbeats either.
> (Naming a project after a generic term is another matter).

I have had multiple people confused at the "token" word in the CTR and
corosync contexts already. I just wanted to suggest to kill that as
early as possible if we can ;-)

> > Site-internal partitioning is handled at exactly that level; only the
> > winning/quorate partition will be running the CTR daemon and
> > re-establish communication with the other CTR instances. It will fence
> > the losers.
> 
> Ah, so thats why you suggested it be a resource.

Yes.

> Question though... what about no-quorum-policy=ignore ?

That was implicit somewhere later on, I think. The CTR must be able to
cope with multiple partitions of the same site, and would only grant the
T to one of them.

> > Probably it makes sense to add a layer of protection here to the CTR,
> > though - if several partitions from the same site connect (which could,
> > conceivably, happen), the CTRs will grant the ticket(s) only to the
> > partition with the highest node count (or, should these be equal,
> > lowest nodeid),
> How about longest uptime instead?  Possibly too variable?

That would work too, this was just to illustrate that there needs to be
a unique tie-breaker of last resort that is guaranteed to break said
tie.

> >> Additionally, when a split-brain happens, how about the existing
> >> stonith mechanism. Should the partition without quorum be stonithed?
> > Yes, just as before.
> Wouldn't that depend on whether a deadman constraint existed for one
> of the lost tickets?

Well, like I said: just as before. We don't have to STONITH anything if
we know that the nodes are clean. But, by the way, we still do, since we
don't trust nodes which failed. So unless we change the algorithm, the
partitions would get shot already, and nothing wrong with that ... Or
differently put: CTR doesn't require any change of behaviour here.

> Isn't kind=deadman for ordering constraints redundant though?

It's not required for this approach, as far as I can see, since this
only needs it for the T dependencies. I don't really care what else it
gets added to ;-)

> > Andrew, Yan - do you think we should allow _values_ for tickets, or
> > should they be strictly defined/undefined/set/unset?
> Unclear.  It might be nice to store the expiration (and/or last grant)
> time in there for admin tools to do something with.
> But that could mean a lot of spurious CIB updates, so maybe its better
> to build that into the ticket daemon's api.

I think sometime later in the discussion I actually made a case for
certain values.

> > The ticket not being set/defined should be identical to the ticket being
> > set to "false/no", as far as I can see - in either case, the ticket is
> > not owned, so all resources associated with it _must_ be stopped, and
> > may not be started again.
> There is a startup issue though.
> You don't want to go fencing yourself before you can start the daemon
> and attempt to get the token.
> 
> But the fencing logic would presumably only happen if you DONT have
> the ticket but DO have an affected resource active.

Right. If you don't own anything that depends on the ticket that you
haven't got, nothing happens.

So no start-up issue - unless someone has misconfigured ticket-protected
resources to be started outside the scope of Pacemaker, but that's
deserved then ;-)

> > Good question. This came up above already briefly ...
> >
> > I _think_ there should be a special value that a ticket can be set to
> > that doesn't fence, but stops everything cleanly.
> 
> Again, wouldn't fencing only happen if a deadman dep made use of the ticket?

Right, all of the above assumed that one actually had resources that
depend on the ticket active. Otherwise, one wouldn't know which nodes to
fence for this anyway.

> Otherwise we probably want:
>    <token id=... loss-policy=(fence|stop|freeze) granted=(true|false) />
> 
> with the daemon only updating the "granted" field.

Yeah. What I wanted to hint at above though was an
owned-policy=(start|stop) to allow admins to cleanly stop the services
even while still owning the ticket - and still be able to recover from a
revocation properly (i.e., still fencing active resources).

> > (Tangent - ownership appears to belong to the status section; the value
> > seems belongs to the cib->ticket section(?).)
> Plausible - since you'd not want nodes to come up and think they have tickets.
> That would also negate my concern about including the expiration time
> in the ticket.

Right. One thing that ties into this here is the "how do tickets expire
if the CTR dies on us", since then noone is around to revoke it from the
CIB.

I thought about handling this in the LRM, CIB, or PE (via the recheck
interval), but they all suck. The cleanest and most reliable way seems
to be to make death-of-ctr fatal for the nodes - just like
ocfs2_controld or sbd via the watchdog.

But storing the acquisition time in the CIB probably is quite useful for
the tools. I assume that typically we'll have <5 tickets around; an
additional time stamp won't hurt us.

Regards,
    Lars

-- 
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde