[Pacemaker] Multi-site support in pacemaker (tokens, deadman, CTR)

Thu Jan 13 04:14:09 EST 2011

Hi all,

sorry for the delay in posting this.

IntroductioN: At LPC 2010, we discussed (once more) that a key feature
for pacemaker in 2011 would be improved support for multi-site clusters;
by multi-site, we mean two (or more) sites with a local cluster each,
and some higher level entity coordinating fail-over across these (as
opposed to "stretched" clusters, where a single cluster might spawn the
whole campus in the city).

Typically, such multi-site environments are also too far apart to
support synchronous communication/replication.

There are several aspects to this that we discussed; Andrew and I first
described and wrote this out a few years ago, so I hope he can remember
the rest ;-)

"Tokens" are, essentially, cluster-wide attributes (similar to node
attributes, just for the whole partition). Via dependencies (similar to
rsc_location), one can specify that certain resources require a specific
token to be set before being started (and, vice versa, need to be
stopped if the token is cleared). You could also think of our current
"quorum" as a special, cluster-wide token that is granted in case of
node majority.

The token thus would be similar to a "site quorum"; i.e., the permission
to manage/own resources associated with that site, which would be
recorded in a rsc dependency. (It'd probably make a lot of sense if this
would support resource sets, so one can easily list all the resources;
also, some resources like m/s may tie their role to token ownership.)

These tokens can be granted/revoked either manually (which I actually
expect will be the default for the classic enterprise clusters), or via
an automated mechanism described further below.

Another aspect to site fail-over is recovery speed. A site can only
activate the resources safely if it can be sure that the other site has
deactivated them. Waiting for them to shutdown "cleanly" could incur
very high latency (think "cascaded stop delays"). So, it would be
desirable if this could be short-circuited. The idea between Andrew and
myself was to introduce the concept of a "dead man" dependency; if the
origin goes away, nodes which host dependent resources are fenced,
immensely speeding up recovery.

It seems to make most sense to make this an attribute of some sort for
the various dependencies that we already have, possibly, to make this
generally available. (It may also be something admins want to
temporarily disable - i.e., for a graceful switch-over, they may not
want to trigger the dead man process always.)

The next bit is what we called the "Cluster Token Registry"; for those
scenarios where the site switch is supposed to be automatic (instead of
the admin revoking the token somewhere, waiting for everything to stop,
and then granting it on the desired site). The participating clusters
would run a daemon/service that would connect to each other, exchange
information on their connectivity details (though conceivably, not mere
majority is relevant, but also current ownership, admin weights, time
of day, capacity ...), and vote on which site gets which token(s); a
token would only be granted to a site once they can be sure that it has
been relinquished by the previous owner, which would need to be
implemented via a timer in most scenarios (see the dead man flag).

Further, sites which lose the vote (either explicitly or implicitly by
being disconnected from the voting body) would obviously need to perform
said release after a sane time-out (to protect against brief connection
issues).

A final component is an idea to ease administration and management of
such environments. The dependencies allow an automated tool to identify
which resources are affected by a given token, and this could be
automatically replicated (and possibly transformed) between sites, to
ensure that all sites have an uptodate configuration of relevant
resources. This would be handled by yet another extension, a CIB
replicator service (that would either run permanently or explicitly when
the admin calls it).

Conceivably, the "inactive" resources may not even be present in the
active CIB of sites which don't own the token (and be inserted once
token ownership is established). This may be an (optional) interesting
feature to keep CIB sizes under control.

Andrew, is that about what we discussed? Any comments from anyone else?
Did I capture what we spoke about at LPC?

Regards,
    Lars

-- 
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde