[ClusterLabs] Coming in 1.1.15: Event-driven alerts

Fri Apr 22 18:55:04 EDT 2016

Ken Gaillot <kgaillot at redhat.com> wrote:
> On 04/21/2016 06:09 PM, Adam Spiers wrote:
> > Ken Gaillot <kgaillot at redhat.com> wrote:
> >> Hello everybody,
> >>
> >> The release cycle for 1.1.15 will be started soon (hopefully tomorrow)!
> >>
> >> The most prominent feature will be Klaus Wenninger's new implementation
> >> of event-driven alerts -- the ability to call scripts whenever
> >> interesting events occur (nodes joining/leaving, resources
> >> starting/stopping, etc.).
> > 
> > Ooh, that sounds cool!  Can it call scripts after fencing has
> > completed?  And how is it determined which node the script runs on,
> > and can that be limited via constraints or similar?
> 
> Yes, it called after all "interesting" events (including fencing), and
> the script can use the provided environment variables to determine what
> type of event it was.

Great.  Does the script run on the DC, or is that configurable somehow?

> We don't notify before events, because at that moment we don't know
> whether the event will really happen or not. We might try but fail.

You lost me here ;-)

> > I'm wondering if it could replace the current fencing_topology hack we
> > use to invoke fence_compute which starts the workflow for recovering
> > VMs off dead OpenStack nova-compute nodes.
> 
> Yes, that is one of the reasons we did this!

Haha, at this point can I say great minds think alike? ;-)

> The initial implementation only allowed for one script to be called (the
> "notification-agent" property), but we quickly found out that someone
> might need to email an administrator, notify nova-compute, and do other
> types of handling as well. Making someone write one script that did
> everything would be too complicated and error-prone (and unsupportable).
> So we abandoned "notification-agent" and went with this new approach.
> 
> Coordinate with Andrew Beekhof for the nova-compute alert script, as he
> already has some ideas for that.

OK.  I'm sure we'll be able to talk about this more next week in Austin!

> > Although even if that's possible, maybe there are good reasons to stay
> > with the fencing_topology approach?
> > 
> > Within the same OpenStack compute node HA scenario, it strikes me that
> > this could be used to invoke "nova service-disable" when the
> > nova-compute service crashes on a compute node and then fails to
> > restart.  This would eliminate the window in between the crash and the
> > nova server timing out the nova-compute service - during which it
> > would otherwise be possible for nova-scheduler to attempt to schedule
> > new VMs on the compute node with the crashed nova-compute service.
> > 
> > IIUC, this is one area where masakari is currently more sophisticated
> > than the approach based on OCF RAs:
> > 
> > https://github.com/ntt-sic/masakari/blob/master/docs/evacuation_patterns.md#evacuation-patterns
> > 
> > Does that make sense?
> 
> Maybe. The script would need to be able to determine based on the
> provided environment variables whether it's in that situation or not.

Yep.