[ClusterLabs] Coming in 1.1.15: Event-driven alerts

Fri Apr 22 17:28:21 UTC 2016

On 04/21/2016 06:09 PM, Adam Spiers wrote:
> Ken Gaillot <kgaillot at redhat.com> wrote:
>> Hello everybody,
>>
>> The release cycle for 1.1.15 will be started soon (hopefully tomorrow)!
>>
>> The most prominent feature will be Klaus Wenninger's new implementation
>> of event-driven alerts -- the ability to call scripts whenever
>> interesting events occur (nodes joining/leaving, resources
>> starting/stopping, etc.).
> 
> Ooh, that sounds cool!  Can it call scripts after fencing has
> completed?  And how is it determined which node the script runs on,
> and can that be limited via constraints or similar?

Yes, it called after all "interesting" events (including fencing), and
the script can use the provided environment variables to determine what
type of event it was.

We don't notify before events, because at that moment we don't know
whether the event will really happen or not. We might try but fail.

> I'm wondering if it could replace the current fencing_topology hack we
> use to invoke fence_compute which starts the workflow for recovering
> VMs off dead OpenStack nova-compute nodes.

Yes, that is one of the reasons we did this!

The initial implementation only allowed for one script to be called (the
"notification-agent" property), but we quickly found out that someone
might need to email an administrator, notify nova-compute, and do other
types of handling as well. Making someone write one script that did
everything would be too complicated and error-prone (and unsupportable).
So we abandoned "notification-agent" and went with this new approach.

Coordinate with Andrew Beekhof for the nova-compute alert script, as he
already has some ideas for that.

> Although even if that's possible, maybe there are good reasons to stay
> with the fencing_topology approach?
> 
> Within the same OpenStack compute node HA scenario, it strikes me that
> this could be used to invoke "nova service-disable" when the
> nova-compute service crashes on a compute node and then fails to
> restart.  This would eliminate the window in between the crash and the
> nova server timing out the nova-compute service - during which it
> would otherwise be possible for nova-scheduler to attempt to schedule
> new VMs on the compute node with the crashed nova-compute service.
> 
> IIUC, this is one area where masakari is currently more sophisticated
> than the approach based on OCF RAs:
> 
> https://github.com/ntt-sic/masakari/blob/master/docs/evacuation_patterns.md#evacuation-patterns
> 
> Does that make sense?

Maybe. The script would need to be able to determine based on the
provided environment variables whether it's in that situation or not.