[Pacemaker] Enable remote monitoring

Thu Nov 8 18:15:02 EST 2012

On Fri, Nov 9, 2012 at 10:03 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
> On Thu, Nov 8, 2012 at 5:24 PM, Gao,Yan <ygao at suse.com> wrote:
>> Hi Andrew,
>>
>> On 11/08/12 13:09, Andrew Beekhof wrote:
>>> On Tue, Nov 6, 2012 at 10:30 PM, Gao,Yan <ygao at suse.com> wrote:
>>>> Hi,
>>>>
>>>> Currently, we can manage VMs via the VM agents. But the services running
>>>> within VMs are not very easy to be monitored. If we could use
>>>> nagios/icinga probes from the host to the guest, that would allow us to
>>>> achieve this.
>>>>
>>>> Lars, Dejan and I have been discussing on this for some time. There have
>>>> been quite some thoughts on how to implement it. Now we are inclined to
>>>> a proposal from Lars. Please let me introduce the idea here, and see
>>>> what you think about it.
>>>>
>>>> First, we could add a resource agent class. The RAs belonging to this
>>>> class wrap around nagois/icinga probes. They can be configured as
>>>> special monitor operations for the VMs. The behaviors should be like:
>>>>
>>>> 1. The special monitor operations start working after the VMs and the
>>>> services inside are started.
>>>>
>>>> 2. Any failure of the monitor operations is treated as the failure of
>>>> the VM, which triggers the recovery of the VM.
>>>>
>>>> Let me show a example:
>>>>
>>>> primitive db-vm ocf:heartbeat:VirtualDomain \
>>>>         params config="db-vm" hypervisor="xen:///" \
>>>>         ip="192.168.1.122" \
>>>>         op monitor nagios:ftp interval="30s" params user="test"
>>>>
>>>> The "nagios:ftp" specifies which monitor agent is used to monitor the
>>>> VM.  It's an optional attributes group expressing "class/provider/type"
>>>> of the monitor agent, which defaults to "ocf:heartbeat:VirtualDomain"
>>>> for this VM (if so, the monitor would be a normal one like we usually
>>>> configure). We can add more monitors like "nagios:www" type and so on.
>>>
>>> What do you propose the XML should look like?
>> Should be like:
>> ...
>> <op id="vm-monitor-30" name="monitor" class="nagios" type="ftp"
>> interval="30s" ignore-first-failures="true">
>>   <instance_attributes id="vm-monitor-30-params">
>>     <nvpair id="vm-monitor-30-params" name="user" value="test">
>>   </instance_attributes>
>> </op>
>> ...
>>
>>>
>>>> We can specify particular "params" for a monitor. And the "ip" is
>>>> actually not a useful parameter for the VirtualDomain, we put it there
>>>> for its monitor operations to inherit, so that we don't have to specify
>>>> for each monitor respectively.
>>>
>>> You plan to add 'ip' to the VirtualDomain metadata?
>> It should be in the metatdata of nagios:ftp and also other monitor
>> agents. We'd like parameters inheritance to avoid configuration repetition.
>
> That sounds overly complex (you now need to do two metadata lookups to
> determine the parameter lists).

Actually more - assuming a VM can contain multiple services which each
one being checked by a nagios script.

> I think I'd prefer to avoid that if
> possible.
>
>>
>>>
>>>>
>>>>
>>>> Other issues:
>>>> - As we can see, there's some time window between when the VM is
>>>> started, but prior to the monitored service starting. A solution is
>>>> adding a "first-failure" flag for the monitor operation, which could
>>>> allow us to ignore the *first* failures of a monitor until it has
>>>> returned healthy once, unless the time is out. Ideally, it could be
>>>> handled in LRM.
>>>
>>> What happens if there is never a first success?
>>> The cluster will never find out.
>> It'll reach the timeout and return.
>
> Which timeout? Not the one in <op...> since the whole operation might
> repeat many times over before succeeding.
>
>> We should give a reasonable monitor
>> timeout I think.
>>
>>>
>>>>
>>>> - A limitation is we would have to specify different monitor interval
>>>> values for the services within a VM. Probably we could fix it in some
>>>> way finally.
>>>>
>>>>
>>>> Anyway, this's the most straightforward solution we can think of so far
>>>> (Please correct me if I'm missing anything). It's open for discussion.
>>>> Any comments and suggestions are welcome and appreciated.
>>>
>>> Doesn't look too bad.  Some finer points to discuss but I'm sure we
>>> can reach agreement.
>> Nice, thanks!
>>
>> Regards,
>>   Gao,Yan
>> --
>> Gao,Yan <ygao at suse.com>
>> Software Engineer
>> China Server Team, SUSE.
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org