[Pacemaker] Enable remote monitoring

Fri Nov 9 16:36:10 EST 2012

On Sat, Nov 10, 2012 at 6:35 AM, David Vossel <dvossel at redhat.com> wrote:
> ----- Original Message -----
>> From: "Lars Marowsky-Bree" <lmb at suse.com>
>> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
>> Sent: Friday, November 9, 2012 11:54:16 AM
>> Subject: Re: [Pacemaker] Enable remote monitoring
>>
>> On 2012-11-09T11:46:59, David Vossel <dvossel at redhat.com> wrote:
>>
>> > What if we made something similar to the concept of an "un-managed"
>> > resource, in that it is only ever monitored, but treated it like a
>> > normal resource.  Meaning start/stop could still execute, but
>> > start is really just the first "monitor" operation and stop just
>> > means the recurring "monitor" cancels.
>> >
>> > Having "start" redirect to "monitor" in pacemaker would take care
>> > of that timeout problem you all were talking about with the first
>> > failure.  Set the start operation to some larger timeout.
>> >  Basically start would just verify that monitor passed once, then
>> > you could move on to the normal monitor timeouts/intervals.  Stop
>> > would always return success and cancel whatever recurring monitors
>> > are running.
>>
>> That's exactly the kind of abstraction a resource agent class can
>> provide though for the nagios agents - no need to have that special
>> knowledge in the PE. The LRM can hide this, which is partly its
>> purpose.
>
> I know nothing about the nagios agents, but if we are taking that route, why not just have the nagios agents map the "start" action to "monitor" instead of making a new class.  Then PE and LRMD don't need any special knowledge of this.

It needs to be a new class because the scripts (I'm pretty sure)
follow a completely different API to anything else we support.

>
>> > Now that I think about it, I'm not even sure we need the new
>> > container Andrew and I talked about at all if we introduce
>> > "monitor-only" resources.
>>
>> Yes. We'd still need it.
>>
>> > At this point we could just have a group where the first member
>> > launches the vm, and all the members after that are the
>> > monitor-only resources that start/stop similar to normal resources
>> > for the PE.  If any of the group members fail, I guess we'd need
>> > the whole group to be recovered in the right order.
>>
>> That's the point - "right order" for a container is not quite the
>> right
>> order as for a regular group. Basically, the group semantics would
>> recover from the failed resource onward, never the VM resource
>> (container).
>
> Seems like it would be possible to create a group option to recover all group members from the first resource onward on a failure.  As long as the vm remains first, would the right order not be preserved?

Please. Not a group. Groups are groups and these are different. Please
don't make groups any worse than they already are ;-)

>
>> If you look at my proposal, I actually made the "container=" a group
>> attribute - because we need to map monitor failures to the container,
>> as
>> well as ignore any stop failures (service is down clean as long as
>> the
>> container is eventually stopped).
>
> I see what you are saying. This is basically the same concept I saw earlier where the monitor resources were defined in the operation tags of a resource. This abstraction moves the resource to the container and makes the monitor operations resource primitives that are only monitored.
>
> I don't think we should have to worry about stop failures at all though.  Stop failures shouldn't occur except possibly at the vm resource.  With the "monitor-only" resources I outlined, or with the new resource class you proposed, stop should just always succeed for these monitor resources.  No ignoring of stop failures should have to occur.
>
>>
>> I think the shell might render this differently, even if we express
>> it
>> as a group + meta-attribute(s) in the XML (which seems to be the way
>> to
>> go). "container ..." is easier on the eyes ;-)
>
> It doesn't matter to me how this is represented for the user through whatever cli/shell tool someone uses.
>
> Assuming we figure out a way to make these nagios resources map "start" to "monitor", and stop all ways succeeds (however we agree on doing that, new class, resource option, whatever) would the abstraction below work if the group could be made to recover at the first resource onward for any failure in the chain?
>
> <group id="vm_and_resources">
> <primitive id="vm" ... />
> <primitive id="rsc-monitor-only-whatever" ..../>
> <primitive id="rsc-monitor-only-somethingelse" .../>
> <group/>
>
> If the above works we get clones of these groups for free, and implementation should be fairly straight forward.

Trust me, don't go there.
Groups are already a sufficiently tortured construct.

> -- Vossel
>
>>
>>
>> Regards,
>>     Lars
>>
>> --
>> Architect Storage/HA
>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix
>> Imendörffer, HRB 21284 (AG Nürnberg)
>> "Experience is the name everyone gives to their mistakes." -- Oscar
>> Wilde
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org