[Pacemaker] chicken-egg-problem with libvirtd and a VM within cluster

Fri Oct 12 03:22:13 EDT 2012

On Fri, Oct 12, 2012 at 3:18 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
> This has been a topic that has popped up occasionally over the years.
> Unfortunately we still don't have a good answer for you.
>
> The "least worst" practice has been to have the RA return OCF_STOPPED
> for non-recurring monitor operations (aka. startup probes) IFF its
> pre-requistites (ie. binaries, or things that might be on a cluster
> file system) are not available.
>
> Possibly we need to begin using the ordering constraints (normally
> used for ordering start operations) for the startup probes too.
> Ie. order(A, B) ==> A.start before B.(monitor_0, start)
>
> I had been resisting that move, but perhaps its time.
>
> (It would also help avoid slamming the cluster with a bazillion
> operations in parallel when several nodes start up together)
>
> Lars? Florian? Comments?

Sure. As Tom correctly observes, the problem (as I know it) occurs
when manually stopping Pacemaker services and then restarting them. As
it shuts down, Pacemaker kills libvirtd (after migrating off or
stopping all VMs), and then as you bring it back up, the probe runs
into an error. The same, btw, applies if you only send the node into
standby mode.

For manual intervention, the workaround is simply this:

- Stop Pacemaker services, or put node in standby (libvirtd stops in
the process as the local clone instance shuts down).
- Do whatever you need to do on that box.
- Start libvirtd.
- Start Pacemaker services, or take node online.

For most people, this issue doesn't occur on system boot, as libvirtd
would normally start before corosync, or corosync/pacemaker isn't part
of the system bootup sequence at all (the latter is preferred for
two-node clusters to prevent fencing shootouts in case of cluster
split brain).

On that ha-kvm.pdf guide, I will add that I'm guessing this is not the
only piece of information missing or outdated in it. However, I have
no rights to that document other than to be named as an original
author and to use it under CC-NC-ND terms like anyone else, and I have
no access to the sources anymore, so there's no way for me to update
it. Maybe the Linbit folks are willing/able to do that.

Back on the probe issue, we're in a bit of a catch-22 als libvirtd can
be freely restarted and stopped while leaving domains (VMs) running.
So the assumption "if libvirtd doesn't run, then the domain can't be
running" simply doesn't hold up. In fact, it's outright dangerous, as
a domain may well run _and have read/write access to shared resources_
while libvirt isn't running. So doing the naive thing and bail out of
monitor if we can't detect a livirtd pid -- that doesn't fly.

What would fly is to check for libvirtd on _every_ invocation of the
RA (well, maybe all except validate and usage), and to restart it on
the sole condition that we can't detect its pid. That, however, breaks
the contract that a probe should be non-invasive and really shouldn't
be touching any system services. Also, a running libvirtd is not
needed, to the best of my knowledge, when the hypervisor in use is Xen
rather than KVM. We could mitigate that by making it configurable, but
the only sane default would be to have this enabled, which again
breaks said contract.

When virsh is invoked with a qemu:///session URI it will actually
start up a user-specific libvirtd by itself, but as far as I know
there is no way to do that for qemu:///system which most people will
be using.

Andrew, your suggestion would fix that issue, but it would obviously
make the config more convoluted. In effect, we'd need one order and
one colo constraint more than we already do. For a silly idea, how
about thinking about being able to define a list of op types in a
constraint, rather than a single op? As in:

order libvirtd_before_virtdom inf: libvirtd:start virtdom_foo:monitor,start
colocation virtdom_on_libvirtd inf: virtdom_foo:Started,Probed libvirtd:Started

(Of course no such thing as a "Probed" role currently exists, so here
we go down the rabbit hole...)

I hope this is useful. Thoughts are much appreciated.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now