[Pacemaker] Multi-site support in pacemaker (tokens, deadman, CTR)

Wed Apr 27 14:55:03 EDT 2011

On 2011-04-26T23:34:16, Yan Gao <ygao at novell.com> wrote:

Hi Yan,

thanks for the good questions, let's get a discussion started!

> >IntroductioN: At LPC 2010, we discussed (once more) that a key feature
> >for pacemaker in 2011 would be improved support for multi-site clusters;
> >by multi-site, we mean two (or more) sites with a local cluster each,
> Would the topology of such a multi-site deployment be indicated in
> cib configuration?

So, the CIB would have the resource definitions, their dependencies,
token state (just like node attribute/value pairs) - as the PE needs
this to decide what to start/stop/fence locally.

It's an excellent question where the configuration of the Cluster Token
Registry would reside; I'd assume that there would be a
resource/primitive/clone (design not finished) that corresponds to the
daemon instance, but that some configuration files would need to be
external - among other things, the SSL certificates for its TCP/SCTP
connections to other sites.

So the topology itself would not be exposed in the CIB.

> Or it's just something corosync would need to care about?

No - this is unrelated to corosync. As far as we - or I, at least -
plan this so far, each site would essentially be an independent local
cluster.

The daemons to be written would basically form an overlay network on top
via a non-latency-sensitive protocol such as TCP/SCTP; granting/revoking
permission to run specific services by granting/revoking the token in
the CIB; the PE would then perform the fencing/stop/start actions.

Perhaps chosing the name "token" for the cluster-wide attributes was not
a wise move, as it does invoke the "token" association from
corosync/totem.

What do you all think about switching this word to "ticket"? And have
the Cluster Ticket Registry manage them? Less confusion later on, I
think.

I'll try the word "ticket" for the rest of the mail and we can see how
that works out ;-)

(I think the word works - you can own a ticket, grant a ticket, cancel,
and revoke tickets ...)

> And the cibs between different sites would still be synchronized?

The idea is that there would be - perhaps as part of the CTR daemon - a
process that would replicate (manually triggered, periodically, or
automatically) the configuration details of resources associated with a
given ticket (which are easily determined since they depend on it) to
the other sites that are eligible for the ticket.

Initially, I'd be quite happy if there was a "replicate now" button to
push or script to call - admins may actually have good reasons not to
immediately replicate everywhere, anyway.

It's conceivable that there would need to be some mangling as
configuration is replicated; e.g., path names and IP addresses may be
different. We _could_ express this using our CIB syntax already
(instance attribute sets take rules, and it'd probably be easy enough
to extend this matching to select on ticket ownership), and perhaps that
is good enough, since I'd imagine there would actually be quite little
to modify.

(Having many differences would make the configuration very complex to
manage and understand; hence, we want a syntax that makes it easy to
have a few different values, and annoying to have many ;-)

> In other words, normally there would be only one DC among the sites,
> right?

Each site would still have its own "DC" at the Pacemaker level. However,
yes, the overlay cluster is - at least for now - active/passive, so only
any one site "owns" a given ticket at any given time.

That doesn't mean that other sites are completely dormant, though - they
could own other tickets (sites may own more than one), or run services
that are associated with _not_ owning the ticket; e.g., the ticket owner
runs the DRBD master, the site without it runs a DRBD replica.

> >"Tokens" are, essentially, cluster-wide attributes (similar to node
> >attributes, just for the whole partition).
> Specifically, a "<tokens>" section with an attribute set (
> "<token_set>" or something) under "/cib/configuration"?

Yes; a ticket section, just like that.

> Should an admin grant a token to the cluster initially?

In the "easiest" case, the agent that grants or cancels a ticket would
indeed be the admin, yes.

> Or grant it to several nodes which are supposed to be from a same
> site?

The ticket is cluster-wide, and a corosync/Pacemaker cluster is one
site, so the admin cannot grant a ticket to a subset of nodes.

If the admin doesn't want some nodes to run the related resources, they
can use rsc_location rules to set that as before.

> Or grant it to a partition after a split-brain happens --  A
> split-brain can happen between the sites or inside a site. How could
> it be distinguished and what policies to handle the scenarios
> respectively? What if a partition split further?

Site-split is handled and arbitrated either manually - the admin revokes
the ticket from the losing site, waits until all services are stopped,
and grants it to the winner - or manually (same process, but determined
on site majority by the CTR).

Site-internal partitioning is handled at exactly that level; only the
winning/quorate partition will be running the CTR daemon and
re-establish communication with the other CTR instances. It will fence
the losers.

Probably it makes sense to add a layer of protection here to the CTR,
though - if several partitions from the same site connect (which could,
conceivably, happen), the CTRs will grant the ticket(s) only to the
partition with the highest node count (or, should these be equal,
lowest nodeid), and immediately revoke it from any other partition.

> Additionally, when a split-brain happens, how about the existing
> stonith mechanism. Should the partition without quorum be stonithed?

Yes, just as before.

> If shouldn't, or if couldn't, should the partition elect a DC? What
> about the no-quorum-policy?

All just as before.

> >Via dependencies (similar to
> >rsc_location), one can specify that certain resources require a specific
> >token to be set before being started
> Which way do you prefer? I found you discussed this in another
> thread last year. The choices mentioned there as:
> - A "<rsc_order>" with "Deadman" order-type specified:
>   <rsc_order id="order-tokenA-rscX" first-token="tokenA" then="rscX"
> kind="Deadman"/>
> - A "<rsc_colocation>":
>   <rsc_colocation id="rscX-with-tokenA" rsc="rscX"
> with-token="tokenA" kind="Deadman"/>

These probably would make sense, but is not the primary focus. I can see
the 'deadman' switch to be added to all constraints (to define the
behaviour should the target no longer be in the desired state), but the
ticket is essentially a cluster-wide equivalent of a node attribute;
hence, reusing the order or colocation rules doesn't seem fitting.

> - There could be a "requires" field in an "op", which could be set
> to "quorum" or "fencing". Similarly, we could also introduce a
> "requires-token" field:
> 
> <op id="rscX-start" name="start" interval="0" requires-token="tokenA"/>
> 
> The shortcoming is a resource cannot depend on multiple tokens.

I don't think this works.

> - A completely new type of constraint:
>   <rsc_token id="rscX-with-tokenA" rsc="rscX" token="tokenA"
> kind="Deadman"/>

Personally, I lean towards this. (Andrew has expressed a wish to do
without the "rsc_" prefix, so lets drop this ;-)

Not sure the kind="Deadman" is actually required, but it probably makes
sense to be able to switch off the big hammer for debugging purposes.
;-)

I don't see why any resource would depend on several tickets; but I can
see a use case for wanting to depend on _not_ owning a ticket, similar
to the node attributes. And the resource would need a role, obviously.

Andrew, Yan - do you think we should allow _values_ for tickets, or
should they be strictly defined/undefined/set/unset?

> >The token thus would be similar to a "site quorum"; i.e., the permission
> >to manage/own resources associated with that site, which would be
> >recorded in a rsc dependency. (It'd probably make a lot of sense if this
> >would support resource sets,
> If so, the "op" and the current "rsc_location" are not preferred.
> >so one can easily list all the resources;
> >also, some resources like m/s may tie their role to token ownership.)

Right.

> >Another aspect to site fail-over is recovery speed. A site can only
> >activate the resources safely if it can be sure that the other site has
> >deactivated them. Waiting for them to shutdown "cleanly" could incur
> >very high latency (think "cascaded stop delays"). So, it would be
> >desirable if this could be short-circuited. The idea between Andrew and
> >myself was to introduce the concept of a "dead man" dependency; if the
> >origin goes away,nodes which host dependent resources are fenced,
> >immensely speeding up recovery.
> Does the "origin" mean "token"?

In the context of deadman dependencies on tickets, yes, the ticket would
be the "origin". Andrew and I had discussed the "deadman" concept as a
more general form of escalating the stop sequence, not just for tickets.

> If so, isn't it supposed to be revoked manually by default? So the
> short-circuited fail-over needs an admin to participate?

No to both; it can be revoked manually, yes, but it isn't going to be
always the case. I'm also not quite sure I understand where this
question is headed; how does it matter here whether the ticket is
revoked manually or not?

> BTW, Xinwei once suggested to treat "the token is not set" and "the
> token is set to no" differently. For the former, the behavior would
> be like the token dependencies don't exist. If the token is
> explicitly set, invoke the appropriate policies. Does that help to
> distinguish scenarios?

The ticket not being set/defined should be identical to the ticket being
set to "false/no", as far as I can see - in either case, the ticket is
not owned, so all resources associated with it _must_ be stopped, and
may not be started again.

> Does it means an option for users to choose if they want an
> immediate fencing or stopping the resources normally? Is it global
> or particularly for a specific token , or even/just for a specific
> dependency?

Good question. This came up above already briefly ...

I _think_ there should be a special value that a ticket can be set to
that doesn't fence, but stops everything cleanly.

However, while the ticket is in this state, the site _still_ owns it (no
other site can get it yet, and were it to lose the ticket due to
expiration, it'd still need to fence all remaining nodes so that the
services can be started elsewhere). 

Perhaps the CTR doesn't even need to know about this - it's a special
setting of the ticket at a given site. Perhaps it makes sense to
distinguish between owning the ticket (as granted on request via the CTR
or manually), and its value (which is set locally)? perhaps:

Ownership is a true/false flag. Value is a positive integer (including
0).

A site that "owns" a ticket of value 0 will stop resources cleanly, and
afterwards relinquish the ticket itself.

A site that "owns" a ticket of any value and loses it will perform the
deadman dance.

A site that does not own a ticket but has a non-zero value for it
defined will request the ticket from the CTR; the CTR will grant it to
the site with the highest bid (but not to a site with 0) (if these are
equal, to the site with the highest node count, if these again are
equal, to the site with the lowest nodeid).

(Tangent - ownership appears to belong to the status section; the value
seems belongs to the cib->ticket section(?).)

The value can be set manually - in that case, it allows the admin to
define a primary site for a given set of resources. (It might also be
modified automatically at a later stage based on whatever metric.)

If a site owns a ticket, but doesn't have the highest value, it would
either fail-back automatically - or require manual intervention, which
I'd assume to be quite common.  (Again, this builds a very simplistic
active/passive overlay.)

Does that make sense, or am I creating more confusion than answers? ;-)

Regards,
    Lars

-- 
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde