[Pacemaker] Enable remote monitoring

Tue Feb 5 13:07:40 EST 2013

----- Original Message -----
> From: "Andrew Beekhof" <andrew at beekhof.net>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Tuesday, February 5, 2013 2:29:11 AM
> Subject: Re: [Pacemaker] Enable remote monitoring
> 
> On Fri, Feb 1, 2013 at 3:37 PM, Gao,Yan <ygao at suse.com> wrote:
> > Hi Andrew,
> >
> > On 01/31/13 14:35, Andrew Beekhof wrote:
> >>
> >> On 24/01/2013, at 3:36 AM, David Vossel <dvossel at redhat.com>
> >> wrote:
> >>
> >>>
> >>>
> >>> ----- Original Message -----
> >>>> From: "Yan Gao" <ygao at suse.com>
> >>>> To: pacemaker at oss.clusterlabs.org
> >>>> Sent: Monday, January 21, 2013 11:28:40 PM
> >>>> Subject: Re: [Pacemaker] Enable remote monitoring
> >>>>
> >>>> Hi,
> >>>> Here's the code for supporting nagios plugins in lrmd:
> >>>>
> >>>> https://github.com/gao-yan/pacemaker/commits/nagios
> >>>>
> >>>> A new resource class "nagios" is introduced.
> >>>>
> >>>> Actions:
> >>>>
> >>>> - probe: A resource defined for a resource container is not
> >>>> probed.
> >>>> (We
> >>>> can also add a condition in pengine to just avoid probing a
> >>>> nagios
> >>>> class
> >>>> resource.)
> >>>
> >>> Yeah, I think the pengine should know to never probe a nagios
> >>> script regardless if it is involved in a container or not.
> >>>
> >>>> - start: Invokes the nagios plugin with specified parameters
> >>>> (Maps
> >>>> the
> >>>> instance attributes to the long options of the nagios plugin).
> >>>> If it
> >>>> returns non-OK, re-invokes it after some delay (delay =
> >>>> start_timeout
> >>>> /
> >>>> 10),  until it returns OK or exceeds the start timeout.
> >>>
> >>> I made a comment about this on the patch.  Shouldn't the
> >>> cmd->timeout value be updated each time it is re-scheduled to
> >>> account for time already spent?
> >>>
> >>>>
> >>>> - monitor: Recurring invocation to the nagios plugin with
> >>>> specified
> >>>> parameters.
> >>>>
> >>>> - stop: Nothing special is done. The recurring monitor is
> >>>> canceled
> >>>> anyway.
> >>>>
> >>>> - metadata: Reads the corresponding metadata from a xml file in
> >>>> NAGIOS_METADATA_DIR.
> >>>>
> >>>> (As we know nagios plugins don't support metadata. The current
> >>>> plan
> >>>> is
> >>>> to generate the corresponding metadata according to the help of
> >>>> the
> >>>> plugins, and put them into NAGIOS_METADATA_DIR for use -- Dejan
> >>>> already
> >>>> has progress on this. Thank, Dejan!)
> >>>>
> >>>>
> >>>> For nagios plugins, the exit code are:
> >>>>
> >>>> STATE_OK        = 0,
> >>>> STATE_WARNING   = 1,
> >>>> STATE_CRITICAL  = 2,
> >>>> STATE_UNKNOWN   = 3,
> >>>> STATE_DEPENDENT = 4,
> >>>>
> >>>> AFAICS, STATE_OK should map to PCMK_EXECRA_OK, and the others
> >>>> should
> >>>> all
> >>>> belong to PCMK_EXECRA_UNKNOWN_ERROR. Well, apparently, there's
> >>>> no
> >>>> code
> >>>> to express "NOT_RUNNING" in nagios plugins. I think it should be
> >>>> fine,
> >>>> since there's no probe.
> >>>>
> >>>> Any suggestions are appreciated!
> >>>
> >>> This mostly looks like what I expected.  I'm letting the whole
> >>> re-scheduling of the start operation roll around in my head a
> >>> bit.  It almost seems like that functionality belongs in the
> >>> service library...  retry executing this action until either the
> >>> timeout is hit or some target return code is encountered.  Any
> >>> thoughts on that?
> >>
> >> Who the what now?
> >> Why do start ops need to be rescheduled?
> > It's very likely that the "start" of the container returns before
> > the
> > services inside are started. Abusing start-delay is not preferred.
> > The
> > idea is, in the start operation of the nagios resource, repeatedly
> > monitoring the service until it returns OK or exceeds the start
> > timeout.
> 
> I thought both stop and start were a no-op and only monitor did
> anything?
> Did we move on from that (I can see why we might, my memory is just a
> little hazy on the subject)?

Start is the first monitor.  This gives us the distinction between start fail (never worked) and monitor fail (something went wrong afterwards).

-- Vossel