[Pacemaker] Enable remote monitoring

Thu Nov 8 00:09:49 EST 2012

On Tue, Nov 6, 2012 at 10:30 PM, Gao,Yan <ygao at suse.com> wrote:
> Hi,
>
> Currently, we can manage VMs via the VM agents. But the services running
> within VMs are not very easy to be monitored. If we could use
> nagios/icinga probes from the host to the guest, that would allow us to
> achieve this.
>
> Lars, Dejan and I have been discussing on this for some time. There have
> been quite some thoughts on how to implement it. Now we are inclined to
> a proposal from Lars. Please let me introduce the idea here, and see
> what you think about it.
>
> First, we could add a resource agent class. The RAs belonging to this
> class wrap around nagois/icinga probes. They can be configured as
> special monitor operations for the VMs. The behaviors should be like:
>
> 1. The special monitor operations start working after the VMs and the
> services inside are started.
>
> 2. Any failure of the monitor operations is treated as the failure of
> the VM, which triggers the recovery of the VM.
>
> Let me show a example:
>
> primitive db-vm ocf:heartbeat:VirtualDomain \
>         params config="db-vm" hypervisor="xen:///" \
>         ip="192.168.1.122" \
>         op monitor nagios:ftp interval="30s" params user="test"
>
> The "nagios:ftp" specifies which monitor agent is used to monitor the
> VM.  It's an optional attributes group expressing "class/provider/type"
> of the monitor agent, which defaults to "ocf:heartbeat:VirtualDomain"
> for this VM (if so, the monitor would be a normal one like we usually
> configure). We can add more monitors like "nagios:www" type and so on.

What do you propose the XML should look like?

> We can specify particular "params" for a monitor. And the "ip" is
> actually not a useful parameter for the VirtualDomain, we put it there
> for its monitor operations to inherit, so that we don't have to specify
> for each monitor respectively.

You plan to add 'ip' to the VirtualDomain metadata?

>
>
> Other issues:
> - As we can see, there's some time window between when the VM is
> started, but prior to the monitored service starting. A solution is
> adding a "first-failure" flag for the monitor operation, which could
> allow us to ignore the *first* failures of a monitor until it has
> returned healthy once, unless the time is out. Ideally, it could be
> handled in LRM.

What happens if there is never a first success?
The cluster will never find out.

>
> - A limitation is we would have to specify different monitor interval
> values for the services within a VM. Probably we could fix it in some
> way finally.
>
>
> Anyway, this's the most straightforward solution we can think of so far
> (Please correct me if I'm missing anything). It's open for discussion.
> Any comments and suggestions are welcome and appreciated.

Doesn't look too bad.  Some finer points to discuss but I'm sure we
can reach agreement.

>
> Thanks,
>   Gao,Yan
> --
> Gao,Yan <ygao at suse.com>
> Software Engineer
> China Server Team, SUSE.
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org