[ClusterLabs] Introducing the Anvil! Intelligent Availability platform

Wed Jul 5 13:55:12 UTC 2017

Wow! I'm looking forward to the September summit talk.

On 07/05/2017 01:52 AM, Digimer wrote:
> Hi all,
> 
>   I suspect by now, many of you here have heard me talk about the Anvil!
> intelligent availability platform. Today, I am proud to announce that it
> is ready for general use!
> 
> https://github.com/ClusterLabs/striker/releases/tag/v2.0.0
> 
>   I started five years ago with an idea of building an "Availability
> Appliance". A single machine where any part could be failed, removed and
> replaced without needing a maintenance window. A system with no single
> point of failure anywhere wrapped behind a very simple interface.
> 
>   The underlying architecture that provides this redundancy was laid
> down years ago as an early tutorial and has been field tested all over
> North America and around the world in the years since. In that time, the
> Anvil! platform has demonstrated over 99.9999% availability!
> 
>   Starting back then, the goal was to write the web interface that made
> it easy to use the Anvil! platform. Then, about two years ago, I decided
> that an Anvil! could be much, much more than just an appliance.
> 
>   It could think for itself.
> 
>   Today, I would like to announce version 2.0.0. This releases
> introduces the ScanCore "decision engine". ScanCore can be thought of as
> a sort of "Layer 3" availability platform. Where Corosync provides
> membership and communications, with Pacemaker (and rgmanager) sitting on
> top monitoring applications and handling fault detection and recovery,
> ScanCore sits on top of both, gathering disparate data, analyzing it and
> making "big picture" decisions on how to best protect the hosted servers.
> 
>   Examples;
> 
> 1. All servers are on node 1, and node 1 suffers a cooling fan failure.
> ScanCore compares against node 2's health, waits a period of time in
> case it is a transient fault and the autonomously live-migrates the
> servers to node 2. Later, node 2 suffers a drive failure, degrading the
> underlying RAID array. ScanCore can then compare the relative risks of a
> failed fan versus a degraded RAID array, determine that the failed fan
> is less risky and automatically migrate the servers back to node 1. If a
> hot-spare kicks in and the array returns to an Optimal state, ScanCore
> will again migrate the servers back to node 2. When node 1's fan failure
> is finally repaired, the servers stay on node 2 as there is no benefit
> to migrating as now both nodes are equally healthy.
> 
> 2. Input power is lost to one UPS, but not the second UPS. ScanCore
> knows that good power is available and, so, doesn't react in any way. If
> input power is lost to both UPSes, however, then ScanCore will decide
> that the greatest risk the server availability is no longer unexpected
> component failure, but instead depleting the batteries. Given this, it
> will decide that the best option to protect the hosted servers is to
> shed load and maximize run time. if the power stays out for too long,
> then ScanCore will determine hard off is imminent, and decide to
> gracefully shut down all hosted servers, withdraw and power off. Later,
> when power returns, the Striker dashboards will monitor the charge rate
> of the UPSes and as soon as it is safe to do so, restart the nodes and
> restore full redundancy.
> 
> 3. Similar to case 2, ScanCore can gather temperature data from multiple
> sources and use this data to distinguish localized cooling failures from
> environmental cooling failures, like the loss of an HVAC or AC system.
> If the former case, ScanCore will migrate servers off and, if critical
> temperatures are reached, shut down systems before hardware damage can
> occur. In the later case, ScanCore will decide that minimizing thermal
> output is the best way to protect hosted servers and, so, will shed load
> to accomplish this. If necessary to avoid damage, ScanCore will perform
> a full shut down. Once ScanCore (on the low-powered Striker dashboards)
> determines thermal levels are safe again, it will restart the nodes and
> restore full redundancy.
> 
>   All of this intelligence is of little use, of course, if it is hard to
> build and maintain an Anvil! system. Perhaps the greatest lesson learned
> from our old tutorial was that the barrier to entry had to be reduced
> dramatically.
> 
> https://www.alteeve.com/w/Build_an_m2_Anvil!
> 
>   So, this release also dramatically simplifies how easy it is to go
> from bare iron to provisioned, protected servers. Even with no
> experience in availability at all, a tech should be able to go from iron
> in boxes to provision servers in one or two days. Almost all steps have
> been automated, which serves the core goal of maximum reliability by
> minimizing the chances for human error.
> 
>   This version also introduces the ability to run entirely offline. This
> version of the Anvil! is entirely self-contained with internal
> repositories making it possible to fully manage an Anvil! with no
> external access to the outside world, including rebuilding Striker
> dashboards or Anvil! nodes after a major fault and building new Anvil!
> node pairs.
> 
>   There is so much more that the Anvil! platform can do, but this
> announcement is already quite long, so I'll stop here.
> 
>   I'm more than happy to answer any questions and, of course, I would
> very much love to hear feedback, suggestions, feature requests or
> critiques.
> 
>   Finally, I want to thank the rest of the team at Alteeve. Without them
> keeping the lights on and our customers happy, I would never have been
> able to put the time in needed to make this release possible. And, of
> course, to all of you for the years of advice, banter and debate. I
> still have very much to learn!
> 
>   Now, time to start working full time on version 3!
>