[ClusterLabs] Ansible role to configure Pacemaker

Thu Jun 7 16:57:18 UTC 2018

Jan Pokorný <jpokorny at redhat.com> wrote:
>While I see why Ansible is compelling, I feel it's important to
>challenge this trend of trying to bend/rebrand _machine-local
>configuration management tool_ as _distributed system management tool_
>(pacemaker is distributed application/framework of sorts), which Ansible
>alone is _not_, as far as I know, hence the effort doesn't seem to be
>100% sound (which really matters if reliability is the goal).

I'm not sure I understand.  Are you saying Ansible is a machine-local
configuration management tool not a distributed system management
tool?  Because I don't think that statement is accurate; Ansible was
absolutely designed from the beginning for orchestrating config
management over multiple machines (unlike Chef or Puppet).  But as a
RH employee you must know that already, so I'm probably missing
something ;-)

>Once more, this has nothing to do with the announced project, it's
>just the trending fuss on this topic that indicates me that people
>independently, as they keenly invent their own wheel (here: Ansible
>roles), get blind to the fallacy everything must work nicely with
>multi machine shared-state scenarios like they are used to with
>single host bootstrapping, without any shortcomings.

Ansible is not intended purely for single-host bootstrapping.
But again I'm sure you already know that, so I'm a bit confused what
your point is here.

>But there are, and precisely because not the optimal tool for the
>task gets selected!  Just imagine what would happen if a single
>machine got configured independently with multiple Ansible actors
>(there may be mechanisms -- relatively easy within the same host --
>that would prevent such interferences, but assume now they are not
>strong enough).

ICBW but it sounds you are imagining a problem which isn't always
there, and even when it is there, it's not big enough to justify
chucking away the other benefits of automating deployment of Pacemaker
via something like Ansible.  In other words, don't throw the baby out
with the bathwater[0].

[0] https://en.wikipedia.org/wiki/Don%27t_throw_the_baby_out_with_the_bathwater

For example I work on a product which uses Ansible running from a
central node to deploy clusters.  By virtue of the documented contract
with the customer about what deployment / maintenance procedures are
supported, we can assume that only one Ansible actor will ever run
concurrently.  If we are worried that the customer will ignore the
documentation and take actions we don't support, we can implement some
kind of simple locking on the deployer node and that's plenty good
enough.  And yes, this makes the deployer node a SPoF, but again there
are perfectly acceptable and simple ways to mitigate that issue
(briefly: make it easy to turn any node into the deployer).

So whilst the concerns you write about here are potentially
correct from a theoretical perspective, in the real world they are
most likely not strong enough to prevent us from being interested in
using (say) Ansible to deploy Pacemaker.

>What will happen?  Likely some mess-ups will occur as
>glorified idempotence is hard to achieve atomically.  Voila, inflicted
>race conditions, one by one, get exercised, until there's enough of
>bad luck that the rule of idempotence gets broken, just because of
>these processes emulating a schizophrenic (at the same time
>multitasking) admin.  Ouch!
>
>Now, reflect This to the situation with possibly concurrent
>cluster configuration.  One cannot really expect the cluster
>stack to be bullet-proof against these sorts of mishandling.
>Single cluster administrator operating at a time?  Ideal!
>Few administrators presumably with separate areas of
>configuration interest?  Pacemaker is quite ready.
>Cluster configuration randomly touched from random node
>at random time (equivalent of said schizophrenic multitasking
>administrator with a single host)?  Chances are off in
>sufficiently long perioud when this happens.
>
>The solution here is to break that randomness, configuration
>is modified either:
>1. from a single node at a time in the cluster (plus preferrably
>   batching all required changes into a single request)
>2, mutual time-critical exclusion of triggering the changes
>   across the nodes
>3. mutual locality-critical exclusion in the subject of the
>   changes initiated from particular nodes

It's hard to know exactly what you mean by case 3 here.

>Putting 1. and 3. aside as not very interesting (1. means
>a degenerate case with single point of failure

I don't think it has to mean that.  It's possible to ensure that
config is only changed from one node at a time via a tool such as
Ansible, without hardcoding that to the same node every time.

>and 3. kills
>the universality), what we get is really a dependency on some
>kind of distributed lock and/or transactional system.
>Well, we have just discovered that what we need to automate our
>predestined configuration in the cluster reliably and without
>hurting universality (like "breaking the node symmetry")

What do you mean by node symmetry and why is it important?

>is said distributed system management ("orchestration") tool.
>Has Ansible these capabilities?

I'm struggling to understand exactly what you mean, but yes I think it
probably does.

>Now, one idea there might be to make the tools like pcs compensate
>for these shortcomings of machine-local configuration management ones.
>Sounds good, right?  Absolutely not, more like a bad joke!
>Because what else can it be, the development of orchestration-like
>features (with all the complexities solved once in corosync/DLM
>already; relaxing non-dependency on the very subject of management
>may not be wise) on top of regular high-level cluster management tool
>only[*] to bridge the gap in something that is simply subpar fit
>in distributed environments to begin with?
>
>As Czech proverb puts it: think twice, act once.

Here you seem to be assuming that an Ansible Pacemaker role would have
to be used in a automated, fully orchestrated scenario where cluster
config is being managed by multiple nodes in a way which requires some
complex consensus model.  Can you give an example of why would anyone
need to do that?

>[*] non-automated/human-triggered usage is generally fine as it's
>    highly unlikely none of 1.-3. would be satisfied, so there
>    would be next to no gain for these workflows

OK, so maybe we are agreed after all.  But if you acknowledge that
manually triggered usage is generally safe, then perhaps you shouldn't
also assume that "people [...] get blind to the fallacy everything
must work nicely with multi machine shared-state scenarios" ;-)