[ClusterLabs] Introducing the Anvil! Intelligent Availability platform

Wed Jul 5 02:52:26 EDT 2017

Hi all,

  I suspect by now, many of you here have heard me talk about the Anvil!
intelligent availability platform. Today, I am proud to announce that it
is ready for general use!

https://github.com/ClusterLabs/striker/releases/tag/v2.0.0

  I started five years ago with an idea of building an "Availability
Appliance". A single machine where any part could be failed, removed and
replaced without needing a maintenance window. A system with no single
point of failure anywhere wrapped behind a very simple interface.

  The underlying architecture that provides this redundancy was laid
down years ago as an early tutorial and has been field tested all over
North America and around the world in the years since. In that time, the
Anvil! platform has demonstrated over 99.9999% availability!

  Starting back then, the goal was to write the web interface that made
it easy to use the Anvil! platform. Then, about two years ago, I decided
that an Anvil! could be much, much more than just an appliance.

  It could think for itself.

  Today, I would like to announce version 2.0.0. This releases
introduces the ScanCore "decision engine". ScanCore can be thought of as
a sort of "Layer 3" availability platform. Where Corosync provides
membership and communications, with Pacemaker (and rgmanager) sitting on
top monitoring applications and handling fault detection and recovery,
ScanCore sits on top of both, gathering disparate data, analyzing it and
making "big picture" decisions on how to best protect the hosted servers.

  Examples;

1. All servers are on node 1, and node 1 suffers a cooling fan failure.
ScanCore compares against node 2's health, waits a period of time in
case it is a transient fault and the autonomously live-migrates the
servers to node 2. Later, node 2 suffers a drive failure, degrading the
underlying RAID array. ScanCore can then compare the relative risks of a
failed fan versus a degraded RAID array, determine that the failed fan
is less risky and automatically migrate the servers back to node 1. If a
hot-spare kicks in and the array returns to an Optimal state, ScanCore
will again migrate the servers back to node 2. When node 1's fan failure
is finally repaired, the servers stay on node 2 as there is no benefit
to migrating as now both nodes are equally healthy.

2. Input power is lost to one UPS, but not the second UPS. ScanCore
knows that good power is available and, so, doesn't react in any way. If
input power is lost to both UPSes, however, then ScanCore will decide
that the greatest risk the server availability is no longer unexpected
component failure, but instead depleting the batteries. Given this, it
will decide that the best option to protect the hosted servers is to
shed load and maximize run time. if the power stays out for too long,
then ScanCore will determine hard off is imminent, and decide to
gracefully shut down all hosted servers, withdraw and power off. Later,
when power returns, the Striker dashboards will monitor the charge rate
of the UPSes and as soon as it is safe to do so, restart the nodes and
restore full redundancy.

3. Similar to case 2, ScanCore can gather temperature data from multiple
sources and use this data to distinguish localized cooling failures from
environmental cooling failures, like the loss of an HVAC or AC system.
If the former case, ScanCore will migrate servers off and, if critical
temperatures are reached, shut down systems before hardware damage can
occur. In the later case, ScanCore will decide that minimizing thermal
output is the best way to protect hosted servers and, so, will shed load
to accomplish this. If necessary to avoid damage, ScanCore will perform
a full shut down. Once ScanCore (on the low-powered Striker dashboards)
determines thermal levels are safe again, it will restart the nodes and
restore full redundancy.

  All of this intelligence is of little use, of course, if it is hard to
build and maintain an Anvil! system. Perhaps the greatest lesson learned
from our old tutorial was that the barrier to entry had to be reduced
dramatically.

https://www.alteeve.com/w/Build_an_m2_Anvil!

  So, this release also dramatically simplifies how easy it is to go
from bare iron to provisioned, protected servers. Even with no
experience in availability at all, a tech should be able to go from iron
in boxes to provision servers in one or two days. Almost all steps have
been automated, which serves the core goal of maximum reliability by
minimizing the chances for human error.

  This version also introduces the ability to run entirely offline. This
version of the Anvil! is entirely self-contained with internal
repositories making it possible to fully manage an Anvil! with no
external access to the outside world, including rebuilding Striker
dashboards or Anvil! nodes after a major fault and building new Anvil!
node pairs.

  There is so much more that the Anvil! platform can do, but this
announcement is already quite long, so I'll stop here.

  I'm more than happy to answer any questions and, of course, I would
very much love to hear feedback, suggestions, feature requests or
critiques.

  Finally, I want to thank the rest of the team at Alteeve. Without them
keeping the lights on and our customers happy, I would never have been
able to put the time in needed to make this release possible. And, of
course, to all of you for the years of advice, banter and debate. I
still have very much to learn!

  Now, time to start working full time on version 3!

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould