[Pacemaker] Enable remote monitoring

Tue Nov 6 06:30:20 EST 2012

Hi,

Currently, we can manage VMs via the VM agents. But the services running
within VMs are not very easy to be monitored. If we could use
nagios/icinga probes from the host to the guest, that would allow us to
achieve this.

Lars, Dejan and I have been discussing on this for some time. There have
been quite some thoughts on how to implement it. Now we are inclined to
a proposal from Lars. Please let me introduce the idea here, and see
what you think about it.

First, we could add a resource agent class. The RAs belonging to this
class wrap around nagois/icinga probes. They can be configured as
special monitor operations for the VMs. The behaviors should be like:

1. The special monitor operations start working after the VMs and the
services inside are started.

2. Any failure of the monitor operations is treated as the failure of
the VM, which triggers the recovery of the VM.

Let me show a example:

primitive db-vm ocf:heartbeat:VirtualDomain \
	params config="db-vm" hypervisor="xen:///" \
	ip="192.168.1.122" \
	op monitor nagios:ftp interval="30s" params user="test"

The "nagios:ftp" specifies which monitor agent is used to monitor the
VM.  It's an optional attributes group expressing "class/provider/type"
of the monitor agent, which defaults to "ocf:heartbeat:VirtualDomain"
for this VM (if so, the monitor would be a normal one like we usually
configure). We can add more monitors like "nagios:www" type and so on.

We can specify particular "params" for a monitor. And the "ip" is
actually not a useful parameter for the VirtualDomain, we put it there
for its monitor operations to inherit, so that we don't have to specify
for each monitor respectively.

Other issues:
- As we can see, there's some time window between when the VM is
started, but prior to the monitored service starting. A solution is
adding a "first-failure" flag for the monitor operation, which could
allow us to ignore the *first* failures of a monitor until it has
returned healthy once, unless the time is out. Ideally, it could be
handled in LRM.

- A limitation is we would have to specify different monitor interval
values for the services within a VM. Probably we could fix it in some
way finally.

Anyway, this's the most straightforward solution we can think of so far
(Please correct me if I'm missing anything). It's open for discussion.
Any comments and suggestions are welcome and appreciated.

Thanks,
  Gao,Yan
-- 
Gao,Yan <ygao at suse.com>
Software Engineer
China Server Team, SUSE.