[Pacemaker] Enable remote monitoring

Fri Feb 1 09:15:51 EST 2013

----- Original Message -----
> From: "Yan Gao" <ygao at suse.com>
> To: pacemaker at oss.clusterlabs.org
> Sent: Thursday, January 31, 2013 10:37:53 PM
> Subject: Re: [Pacemaker] Enable remote monitoring
> 
> Hi Andrew,
> 
> On 01/31/13 14:35, Andrew Beekhof wrote:
> > 
> > On 24/01/2013, at 3:36 AM, David Vossel <dvossel at redhat.com> wrote:
> > 
> >>
> >>
> >> ----- Original Message -----
> >>> From: "Yan Gao" <ygao at suse.com>
> >>> To: pacemaker at oss.clusterlabs.org
> >>> Sent: Monday, January 21, 2013 11:28:40 PM
> >>> Subject: Re: [Pacemaker] Enable remote monitoring
> >>>
> >>> Hi,
> >>> Here's the code for supporting nagios plugins in lrmd:
> >>>
> >>> https://github.com/gao-yan/pacemaker/commits/nagios
> >>>
> >>> A new resource class "nagios" is introduced.
> >>>
> >>> Actions:
> >>>
> >>> - probe: A resource defined for a resource container is not
> >>> probed.
> >>> (We
> >>> can also add a condition in pengine to just avoid probing a
> >>> nagios
> >>> class
> >>> resource.)
> >>
> >> Yeah, I think the pengine should know to never probe a nagios
> >> script regardless if it is involved in a container or not.
> >>
> >>> - start: Invokes the nagios plugin with specified parameters
> >>> (Maps
> >>> the
> >>> instance attributes to the long options of the nagios plugin). If
> >>> it
> >>> returns non-OK, re-invokes it after some delay (delay =
> >>> start_timeout
> >>> /
> >>> 10),  until it returns OK or exceeds the start timeout.
> >>
> >> I made a comment about this on the patch.  Shouldn't the
> >> cmd->timeout value be updated each time it is re-scheduled to
> >> account for time already spent?
> >>
> >>>
> >>> - monitor: Recurring invocation to the nagios plugin with
> >>> specified
> >>> parameters.
> >>>
> >>> - stop: Nothing special is done. The recurring monitor is
> >>> canceled
> >>> anyway.
> >>>
> >>> - metadata: Reads the corresponding metadata from a xml file in
> >>> NAGIOS_METADATA_DIR.
> >>>
> >>> (As we know nagios plugins don't support metadata. The current
> >>> plan
> >>> is
> >>> to generate the corresponding metadata according to the help of
> >>> the
> >>> plugins, and put them into NAGIOS_METADATA_DIR for use -- Dejan
> >>> already
> >>> has progress on this. Thank, Dejan!)
> >>>
> >>>
> >>> For nagios plugins, the exit code are:
> >>>
> >>> STATE_OK        = 0,
> >>> STATE_WARNING   = 1,
> >>> STATE_CRITICAL  = 2,
> >>> STATE_UNKNOWN   = 3,
> >>> STATE_DEPENDENT = 4,
> >>>
> >>> AFAICS, STATE_OK should map to PCMK_EXECRA_OK, and the others
> >>> should
> >>> all
> >>> belong to PCMK_EXECRA_UNKNOWN_ERROR. Well, apparently, there's no
> >>> code
> >>> to express "NOT_RUNNING" in nagios plugins. I think it should be
> >>> fine,
> >>> since there's no probe.
> >>>
> >>> Any suggestions are appreciated!
> >>
> >> This mostly looks like what I expected.  I'm letting the whole
> >> re-scheduling of the start operation roll around in my head a
> >> bit.  It almost seems like that functionality belongs in the
> >> service library...  retry executing this action until either the
> >> timeout is hit or some target return code is encountered.  Any
> >> thoughts on that?
> > 
> > Who the what now?
> > Why do start ops need to be rescheduled?
> It's very likely that the "start" of the container returns before the
> services inside are started. Abusing start-delay is not preferred.
> The
> idea is, in the start operation of the nagios resource, repeatedly
> monitoring the service until it returns OK or exceeds the start
> timeout.

It is likely I'll have to do something similar for my whitebox use case with the lrmd connection resources.

-- Vossel

> The latest code for supporting nagios plugin in lrmd is in:
> https://github.com/gao-yan/pacemaker/commits/nagios
> 
> And the code for supporting container in policy engine is still in:
> https://github.com/ClusterLabs/pacemaker/pull/195
> 
> Thanks,
>   Gao,Yan
> --
> Gao,Yan <ygao at suse.com>
> Software Engineer
> China Server Team, SUSE.
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>