[ClusterLabs] Introducing the Anvil! Intelligent Availability platform
ccaulfie at redhat.com
Thu Jul 6 03:29:57 EDT 2017
On 05/07/17 14:55, Ken Gaillot wrote:
> Wow! I'm looking forward to the September summit talk.
Me too! Congratulations on the release :)
> On 07/05/2017 01:52 AM, Digimer wrote:
>> Hi all,
>> I suspect by now, many of you here have heard me talk about the Anvil!
>> intelligent availability platform. Today, I am proud to announce that it
>> is ready for general use!
>> I started five years ago with an idea of building an "Availability
>> Appliance". A single machine where any part could be failed, removed and
>> replaced without needing a maintenance window. A system with no single
>> point of failure anywhere wrapped behind a very simple interface.
>> The underlying architecture that provides this redundancy was laid
>> down years ago as an early tutorial and has been field tested all over
>> North America and around the world in the years since. In that time, the
>> Anvil! platform has demonstrated over 99.9999% availability!
>> Starting back then, the goal was to write the web interface that made
>> it easy to use the Anvil! platform. Then, about two years ago, I decided
>> that an Anvil! could be much, much more than just an appliance.
>> It could think for itself.
>> Today, I would like to announce version 2.0.0. This releases
>> introduces the ScanCore "decision engine". ScanCore can be thought of as
>> a sort of "Layer 3" availability platform. Where Corosync provides
>> membership and communications, with Pacemaker (and rgmanager) sitting on
>> top monitoring applications and handling fault detection and recovery,
>> ScanCore sits on top of both, gathering disparate data, analyzing it and
>> making "big picture" decisions on how to best protect the hosted servers.
>> 1. All servers are on node 1, and node 1 suffers a cooling fan failure.
>> ScanCore compares against node 2's health, waits a period of time in
>> case it is a transient fault and the autonomously live-migrates the
>> servers to node 2. Later, node 2 suffers a drive failure, degrading the
>> underlying RAID array. ScanCore can then compare the relative risks of a
>> failed fan versus a degraded RAID array, determine that the failed fan
>> is less risky and automatically migrate the servers back to node 1. If a
>> hot-spare kicks in and the array returns to an Optimal state, ScanCore
>> will again migrate the servers back to node 2. When node 1's fan failure
>> is finally repaired, the servers stay on node 2 as there is no benefit
>> to migrating as now both nodes are equally healthy.
>> 2. Input power is lost to one UPS, but not the second UPS. ScanCore
>> knows that good power is available and, so, doesn't react in any way. If
>> input power is lost to both UPSes, however, then ScanCore will decide
>> that the greatest risk the server availability is no longer unexpected
>> component failure, but instead depleting the batteries. Given this, it
>> will decide that the best option to protect the hosted servers is to
>> shed load and maximize run time. if the power stays out for too long,
>> then ScanCore will determine hard off is imminent, and decide to
>> gracefully shut down all hosted servers, withdraw and power off. Later,
>> when power returns, the Striker dashboards will monitor the charge rate
>> of the UPSes and as soon as it is safe to do so, restart the nodes and
>> restore full redundancy.
>> 3. Similar to case 2, ScanCore can gather temperature data from multiple
>> sources and use this data to distinguish localized cooling failures from
>> environmental cooling failures, like the loss of an HVAC or AC system.
>> If the former case, ScanCore will migrate servers off and, if critical
>> temperatures are reached, shut down systems before hardware damage can
>> occur. In the later case, ScanCore will decide that minimizing thermal
>> output is the best way to protect hosted servers and, so, will shed load
>> to accomplish this. If necessary to avoid damage, ScanCore will perform
>> a full shut down. Once ScanCore (on the low-powered Striker dashboards)
>> determines thermal levels are safe again, it will restart the nodes and
>> restore full redundancy.
>> All of this intelligence is of little use, of course, if it is hard to
>> build and maintain an Anvil! system. Perhaps the greatest lesson learned
>> from our old tutorial was that the barrier to entry had to be reduced
>> So, this release also dramatically simplifies how easy it is to go
>> from bare iron to provisioned, protected servers. Even with no
>> experience in availability at all, a tech should be able to go from iron
>> in boxes to provision servers in one or two days. Almost all steps have
>> been automated, which serves the core goal of maximum reliability by
>> minimizing the chances for human error.
>> This version also introduces the ability to run entirely offline. This
>> version of the Anvil! is entirely self-contained with internal
>> repositories making it possible to fully manage an Anvil! with no
>> external access to the outside world, including rebuilding Striker
>> dashboards or Anvil! nodes after a major fault and building new Anvil!
>> node pairs.
>> There is so much more that the Anvil! platform can do, but this
>> announcement is already quite long, so I'll stop here.
>> I'm more than happy to answer any questions and, of course, I would
>> very much love to hear feedback, suggestions, feature requests or
>> Finally, I want to thank the rest of the team at Alteeve. Without them
>> keeping the lights on and our customers happy, I would never have been
>> able to put the time in needed to make this release possible. And, of
>> course, to all of you for the years of advice, banter and debate. I
>> still have very much to learn!
>> Now, time to start working full time on version 3!
> Users mailing list: Users at clusterlabs.org
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users