[ClusterLabs] Position of pacemaker in today's HA world

Fri Oct 5 07:47:37 EDT 2018

Hello HA enthusiasts,

I've come by an interesting article on the topic of how high
availability (possibly, I couldn't witness this first hand since
I don't have a time machine, but some of you can perhaps comment
if the picture matches own experience) historically evolved
from the perspective of database engines.  In part, it may be
a promo for a particular product but this is in no way an attempt
to endorse it -- the text comes informative on its own merit:

<https://www.cockroachlabs.com/blog/brief-history-high-availability/>

It perpetuates how the first, easy step towards "ideal HA" was
an active-passive setup, moreover with statefull resources (like DBs)
first using synchronous replication of the state and hence their
overall availability relying on the backup being functional,
then asynchronously, allowing for losing bits.
(Note that any non-trivial application will always require some
notion of rather persistent state -- as mentioned several times
in this venue, stateless services do not need to bother with all
the "HA coordination" burden since there are typically light-weight
alternatives for "bring services up in this order" type of tasks,
hence I explicitly exclude them from further discussion).

Then it talks about "sharding" (I must admit I haven't heard this
term before), splitting a two-node active-passive monolith into
multiple active-passive pairs, using some domain-specific cuts
(like primary key ranges for tables in DB) + some kind of gateway
in front of them and used to route the requests to the corresponding
pair.

Finally, the evolution brought us to active-active setups, that
typically solve the consistency issues amongst partly independent
nodes with after-the-fact conflict reconciliation.  Alternative
to this is an before-the-fact consensus negotiation on what the
next "true" shared state will be -- they call this arrangement
multi-active in the arrangement, and apparently, it means that
the main mechanisms, membership and consensus, of corosync-pacemaker
stack are duplicated privately on this resource level.

* * *

This brings me to what I want to discuss -- relevancy of
corosync-pacemaker clusters in the light of increasingly common
resource-level "private" clustering (amongst other trends like
a push towards containerization), and how to perhaps rearticulate
it's mission to stay relevant for years to come.

I perceive the pacemaker's biggest value currently in:

* HA-fying plain non-distributed services, either as active-passive
  or even active-active provided that "shared state" problem is
  either non-existent or off-loaded elsewhere -- distributed file
  system/storage, distributed DB, etc.

* helping in the "last mile" for multiple-actors-ready active-passive
  services (matches multi-role resource agent arrangement)

* multisite/cluster-of-clusters handling in combination with booth

and their (almost) arbitrarily complex combinations, all while
achieving proper sanity through node-level isolation should the
HA-damaging failures occur.

On the other hand, with a standalone self-clustering resources
(at best, they could reuse the facilities of corosync alone for
their function), perhaps the only value added would be this
"isolation" part, but then, stonith-ng/pacemaker-fenced together
with static configuration file would be all that's needed so that
such resource can hook into it.  Note that both "sharding
gateway/router", conflict reconciliation and perhaps even consensus
negotiation appear to be highly application specific.  To be
relevant in those contexts, the opposite to "external wrapping"
would be needed -- making the framework complete, offering the
library/API so that the applications are built on top of this
natively.  An example of this I've seen when doing a brief research
some time ago is <https://github.com/NetComposer/nkcluster>
(on that note, Erlang was designed specifically with fault-tolerant,
resilient and distributed applications in mind, making me wonder
if it was ever considered originally in what later became pacemaker).

Also, one of the fields where pacemaker used to be very helpful was
a concise startup/shutdown ordering amongst multiple on-node
services.  This is partially obviated with smart init managers, most
blatantly systemd on Linux platform, playing whole another league
than old, inflexible init systems of the past when the foundation
of pacemaker was laid out.

* * *

Please, don't take this as a blasphemy, I am just trying to put my
head out of the tunnel (or sand, if you want), to view the value of
corosync-pacemaker stack in the IT infrastructures of today and future,
and to gather feedback on this topic, perhaps together with ideas how
to stay indeed relevant amongst all the "private clustering",
management and orchestration of resources proliferation we
can observe for the past years (which makes the surface slightly
different than it was when heartbeat [and Red Hat Cluster Suite]
was the thing).

Please share your thoughts with me/us, even if it will not be
the most encouraging thing to hear, since
- staying realistic is important
- staying relevant is what prevents becoming a fossil tomorrow
:-)

Happy World Teacher's Day.

-- 
Jan (Poki)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20181005/35364112/attachment.sig>