[Pacemaker] Multi-site support in pacemaker (tokens, deadman, CTR)

Fri Apr 29 04:32:25 EDT 2011

With such a long email, assume agreement for anything I don't
explicitly complain about :-)

On Wed, Apr 27, 2011 at 8:55 PM, Lars Marowsky-Bree <lmb at novell.com> wrote:
> On 2011-04-26T23:34:16, Yan Gao <ygao at novell.com> wrote:
>
> Hi Yan,
>
> thanks for the good questions, let's get a discussion started!
>
>> >IntroductioN: At LPC 2010, we discussed (once more) that a key feature
>> >for pacemaker in 2011 would be improved support for multi-site clusters;
>> >by multi-site, we mean two (or more) sites with a local cluster each,
>> Would the topology of such a multi-site deployment be indicated in
>> cib configuration?
>
> So, the CIB would have the resource definitions, their dependencies,
> token state (just like node attribute/value pairs) - as the PE needs
> this to decide what to start/stop/fence locally.
>
> It's an excellent question where the configuration of the Cluster Token
> Registry would reside; I'd assume that there would be a
> resource/primitive/clone (design not finished) that corresponds to the
> daemon instance,

A resource or another daemon like crmd/cib/etc?
Could go either way I guess.

> but that some configuration files would need to be
> external - among other things, the SSL certificates for its TCP/SCTP
> connections to other sites.
>
> So the topology itself would not be exposed in the CIB.
>
>> Or it's just something corosync would need to care about?
>
> No - this is unrelated to corosync. As far as we - or I, at least -
> plan this so far, each site would essentially be an independent local
> cluster.
>
> The daemons to be written would basically form an overlay network on top
> via a non-latency-sensitive protocol such as TCP/SCTP; granting/revoking
> permission to run specific services by granting/revoking the token in
> the CIB; the PE would then perform the fencing/stop/start actions.
>
> Perhaps chosing the name "token" for the cluster-wide attributes was not
> a wise move, as it does invoke the "token" association from
> corosync/totem.
>
> What do you all think about switching this word to "ticket"? And have
> the Cluster Ticket Registry manage them? Less confusion later on, I
> think.
>
> I'll try the word "ticket" for the rest of the mail and we can see how
> that works out ;-)
>
> (I think the word works - you can own a ticket, grant a ticket, cancel,
> and revoke tickets ...)

Maybe.
I think token is still valid though.  Not like only one project in the
world uses heartbeats either.
(Naming a project after a generic term is another matter).

>> And the cibs between different sites would still be synchronized?
>
> The idea is that there would be - perhaps as part of the CTR daemon - a
> process that would replicate (manually triggered, periodically, or
> automatically) the configuration details of resources associated with a
> given ticket (which are easily determined since they depend on it) to
> the other sites that are eligible for the ticket.
>
> Initially, I'd be quite happy if there was a "replicate now" button to
> push or script to call - admins may actually have good reasons not to
> immediately replicate everywhere, anyway.

Agreed.  Automation can happen further down the track if there is
sufficient demand.

>
> It's conceivable that there would need to be some mangling as
> configuration is replicated; e.g., path names and IP addresses may be
> different. We _could_ express this using our CIB syntax already
> (instance attribute sets take rules, and it'd probably be easy enough
> to extend this matching to select on ticket ownership),

This sounded frightening the first time I read it, but I think I'm
getting used to the idea.

> and perhaps that
> is good enough, since I'd imagine there would actually be quite little
> to modify.
>
> (Having many differences would make the configuration very complex to
> manage and understand;

Exactly.

> hence, we want a syntax that makes it easy to
> have a few different values, and annoying to have many ;-)
>
>> In other words, normally there would be only one DC among the sites,
>> right?
>
> Each site would still have its own "DC" at the Pacemaker level. However,
> yes, the overlay cluster is - at least for now - active/passive, so only
> any one site "owns" a given ticket at any given time.
>
> That doesn't mean that other sites are completely dormant, though - they
> could own other tickets (sites may own more than one), or run services
> that are associated with _not_ owning the ticket; e.g., the ticket owner
> runs the DRBD master, the site without it runs a DRBD replica.
>
>> >"Tokens" are, essentially, cluster-wide attributes (similar to node
>> >attributes, just for the whole partition).
>> Specifically, a "<tokens>" section with an attribute set (
>> "<token_set>" or something) under "/cib/configuration"?
>
> Yes; a ticket section, just like that.
>
>> Should an admin grant a token to the cluster initially?
>
> In the "easiest" case, the agent that grants or cancels a ticket would
> indeed be the admin, yes.
>
>> Or grant it to several nodes which are supposed to be from a same
>> site?
>
> The ticket is cluster-wide, and a corosync/Pacemaker cluster is one
> site, so the admin cannot grant a ticket to a subset of nodes.
>
> If the admin doesn't want some nodes to run the related resources, they
> can use rsc_location rules to set that as before.
>
>> Or grant it to a partition after a split-brain happens --  A
>> split-brain can happen between the sites or inside a site. How could
>> it be distinguished and what policies to handle the scenarios
>> respectively? What if a partition split further?
>
> Site-split is handled and arbitrated either manually - the admin revokes
> the ticket from the losing site, waits until all services are stopped,
> and grants it to the winner - or manually (same process, but determined
> on site majority by the CTR).
>
> Site-internal partitioning is handled at exactly that level; only the
> winning/quorate partition will be running the CTR daemon and
> re-establish communication with the other CTR instances. It will fence
> the losers.

Ah, so thats why you suggested it be a resource.
Question though... what about no-quorum-policy=ignore ?

> Probably it makes sense to add a layer of protection here to the CTR,
> though - if several partitions from the same site connect (which could,
> conceivably, happen), the CTRs will grant the ticket(s) only to the
> partition with the highest node count (or, should these be equal,
> lowest nodeid),

How about longest uptime instead?  Possibly too variable?

> and immediately revoke it from any other partition.
>
>> Additionally, when a split-brain happens, how about the existing
>> stonith mechanism. Should the partition without quorum be stonithed?
>
> Yes, just as before.

Wouldn't that depend on whether a deadman constraint existed for one
of the lost tickets?

>
>> If shouldn't, or if couldn't, should the partition elect a DC? What
>> about the no-quorum-policy?
>
> All just as before.
>
>> >Via dependencies (similar to
>> >rsc_location), one can specify that certain resources require a specific
>> >token to be set before being started
>> Which way do you prefer? I found you discussed this in another
>> thread last year. The choices mentioned there as:
>> - A "<rsc_order>" with "Deadman" order-type specified:
>>   <rsc_order id="order-tokenA-rscX" first-token="tokenA" then="rscX"
>> kind="Deadman"/>
>> - A "<rsc_colocation>":
>>   <rsc_colocation id="rscX-with-tokenA" rsc="rscX"
>> with-token="tokenA" kind="Deadman"/>
>
> These probably would make sense, but is not the primary focus. I can see
> the 'deadman' switch to be added to all constraints (to define the
> behaviour should the target no longer be in the desired state), but the
> ticket is essentially a cluster-wide equivalent of a node attribute;
> hence, reusing the order or colocation rules doesn't seem fitting.

I think its ok.  I'd probably prefer it.
Isn't kind=deadman for ordering constraints redundant though?

>
>> - There could be a "requires" field in an "op", which could be set
>> to "quorum" or "fencing". Similarly, we could also introduce a
>> "requires-token" field:
>>
>> <op id="rscX-start" name="start" interval="0" requires-token="tokenA"/>
>>
>> The shortcoming is a resource cannot depend on multiple tokens.
>
> I don't think this works.

Agree.

>
>> - A completely new type of constraint:
>>   <rsc_token id="rscX-with-tokenA" rsc="rscX" token="tokenA"
>> kind="Deadman"/>
>
> Personally, I lean towards this. (Andrew has expressed a wish to do
> without the "rsc_" prefix, so lets drop this ;-)
>
> Not sure the kind="Deadman" is actually required, but it probably makes
> sense to be able to switch off the big hammer for debugging purposes.
> ;-)
>
> I don't see why any resource would depend on several tickets; but I can
> see a use case for wanting to depend on _not_ owning a ticket, similar
> to the node attributes. And the resource would need a role, obviously.
>
> Andrew, Yan - do you think we should allow _values_ for tickets, or
> should they be strictly defined/undefined/set/unset?

Unclear.  It might be nice to store the expiration (and/or last grant)
time in there for admin tools to do something with.
But that could mean a lot of spurious CIB updates, so maybe its better
to build that into the ticket daemon's api.

>> >The token thus would be similar to a "site quorum"; i.e., the permission
>> >to manage/own resources associated with that site, which would be
>> >recorded in a rsc dependency. (It'd probably make a lot of sense if this
>> >would support resource sets,
>> If so, the "op" and the current "rsc_location" are not preferred.
>> >so one can easily list all the resources;
>> >also, some resources like m/s may tie their role to token ownership.)
>
> Right.
>
>> >Another aspect to site fail-over is recovery speed. A site can only
>> >activate the resources safely if it can be sure that the other site has
>> >deactivated them. Waiting for them to shutdown "cleanly" could incur
>> >very high latency (think "cascaded stop delays"). So, it would be
>> >desirable if this could be short-circuited. The idea between Andrew and
>> >myself was to introduce the concept of a "dead man" dependency; if the
>> >origin goes away,nodes which host dependent resources are fenced,
>> >immensely speeding up recovery.
>> Does the "origin" mean "token"?
>
> In the context of deadman dependencies on tickets, yes, the ticket would
> be the "origin". Andrew and I had discussed the "deadman" concept as a
> more general form of escalating the stop sequence, not just for tickets.
>
>> If so, isn't it supposed to be revoked manually by default? So the
>> short-circuited fail-over needs an admin to participate?
>
> No to both; it can be revoked manually, yes, but it isn't going to be
> always the case. I'm also not quite sure I understand where this
> question is headed; how does it matter here whether the ticket is
> revoked manually or not?
>
>> BTW, Xinwei once suggested to treat "the token is not set" and "the
>> token is set to no" differently. For the former, the behavior would
>> be like the token dependencies don't exist. If the token is
>> explicitly set, invoke the appropriate policies. Does that help to
>> distinguish scenarios?
>
> The ticket not being set/defined should be identical to the ticket being
> set to "false/no", as far as I can see - in either case, the ticket is
> not owned, so all resources associated with it _must_ be stopped, and
> may not be started again.

There is a startup issue though.
You don't want to go fencing yourself before you can start the daemon
and attempt to get the token.

But the fencing logic would presumably only happen if you DONT have
the ticket but DO have an affected resource active.

>
>> Does it means an option for users to choose if they want an
>> immediate fencing or stopping the resources normally? Is it global
>> or particularly for a specific token , or even/just for a specific
>> dependency?
>
> Good question. This came up above already briefly ...
>
> I _think_ there should be a special value that a ticket can be set to
> that doesn't fence, but stops everything cleanly.

Again, wouldn't fencing only happen if a deadman dep made use of the ticket?

Otherwise we probably want:
   <token id=... loss-policy=(fence|stop|freeze) granted=(true|false) />

with the daemon only updating the "granted" field.

>
> However, while the ticket is in this state, the site _still_ owns it (no
> other site can get it yet, and were it to lose the ticket due to
> expiration, it'd still need to fence all remaining nodes so that the
> services can be started elsewhere).
>
> Perhaps the CTR doesn't even need to know about this - it's a special
> setting of the ticket at a given site. Perhaps it makes sense to
> distinguish between owning the ticket (as granted on request via the CTR
> or manually), and its value (which is set locally)? perhaps:

I think I need to hear some responses to questions above before I
comment on the below.

>
> Ownership is a true/false flag. Value is a positive integer (including
> 0).
>
> A site that "owns" a ticket of value 0 will stop resources cleanly, and
> afterwards relinquish the ticket itself.
>
> A site that "owns" a ticket of any value and loses it will perform the
> deadman dance.
>
> A site that does not own a ticket but has a non-zero value for it
> defined will request the ticket from the CTR; the CTR will grant it to
> the site with the highest bid (but not to a site with 0) (if these are
> equal, to the site with the highest node count, if these again are
> equal, to the site with the lowest nodeid).
>
> (Tangent - ownership appears to belong to the status section; the value
> seems belongs to the cib->ticket section(?).)

Plausible - since you'd not want nodes to come up and think they have tickets.
That would also negate my concern about including the expiration time
in the ticket.

>
> The value can be set manually - in that case, it allows the admin to
> define a primary site for a given set of resources. (It might also be
> modified automatically at a later stage based on whatever metric.)
>
> If a site owns a ticket, but doesn't have the highest value, it would
> either fail-back automatically - or require manual intervention, which
> I'd assume to be quite common.  (Again, this builds a very simplistic
> active/passive overlay.)
>
> Does that make sense, or am I creating more confusion than answers? ;-)
>
>
> Regards,
>    Lars
>
> --
> Architect Storage/HA, OPS Engineering, Novell, Inc.
> SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>