[ClusterLabs] Coming in 1.1.15: Event-driven alerts

Mon May 2 18:05:51 EDT 2016

On 04/22/2016 05:55 PM, Adam Spiers wrote:
> Ken Gaillot <kgaillot at redhat.com> wrote:
>> On 04/21/2016 06:09 PM, Adam Spiers wrote:
>>> Ken Gaillot <kgaillot at redhat.com> wrote:
>>>> Hello everybody,
>>>>
>>>> The release cycle for 1.1.15 will be started soon (hopefully tomorrow)!
>>>>
>>>> The most prominent feature will be Klaus Wenninger's new implementation
>>>> of event-driven alerts -- the ability to call scripts whenever
>>>> interesting events occur (nodes joining/leaving, resources
>>>> starting/stopping, etc.).
>>>
>>> Ooh, that sounds cool!  Can it call scripts after fencing has
>>> completed?  And how is it determined which node the script runs on,
>>> and can that be limited via constraints or similar?
>>
>> Yes, it called after all "interesting" events (including fencing), and
>> the script can use the provided environment variables to determine what
>> type of event it was.
> 
> Great.  Does the script run on the DC, or is that configurable somehow?

The script runs on all cluster nodes, to give maximum flexibility and
resiliency (during partitions etc.). Scripts must handle ordering and
de-duplication themselves, if needed.

A script that isn't too concerned about partitions might simply check
whether the local node is the DC, and only take action if so, to avoid
duplicates.

We're definitely interested in hearing how people approach these issues.
The possibilities for what an alert script might do are wide open, and
we want to be as flexible as possible at this stage. If the community
settles on certain approaches or finds certain gaps, we can enhance the
support in those areas as needed.

>> We don't notify before events, because at that moment we don't know
>> whether the event will really happen or not. We might try but fail.
> 
> You lost me here ;-)

We only call alert scripts after an event occurs, because we can't
predict the future. :-) For example, we don't know whether a node is
about to join or leave the cluster. Or for fencing, we might try to
fence but be unsuccessful -- and the part of pacemaker that calls the
alert scripts won't even know about fencing initiated outside cluster
control, such as by DLM or a human running stonith_admin.

>>> I'm wondering if it could replace the current fencing_topology hack we
>>> use to invoke fence_compute which starts the workflow for recovering
>>> VMs off dead OpenStack nova-compute nodes.
>>
>> Yes, that is one of the reasons we did this!
> 
> Haha, at this point can I say great minds think alike? ;-)
> 
>> The initial implementation only allowed for one script to be called (the
>> "notification-agent" property), but we quickly found out that someone
>> might need to email an administrator, notify nova-compute, and do other
>> types of handling as well. Making someone write one script that did
>> everything would be too complicated and error-prone (and unsupportable).
>> So we abandoned "notification-agent" and went with this new approach.
>>
>> Coordinate with Andrew Beekhof for the nova-compute alert script, as he
>> already has some ideas for that.
> 
> OK.  I'm sure we'll be able to talk about this more next week in Austin!
> 
>>> Although even if that's possible, maybe there are good reasons to stay
>>> with the fencing_topology approach?
>>>
>>> Within the same OpenStack compute node HA scenario, it strikes me that
>>> this could be used to invoke "nova service-disable" when the
>>> nova-compute service crashes on a compute node and then fails to
>>> restart.  This would eliminate the window in between the crash and the
>>> nova server timing out the nova-compute service - during which it
>>> would otherwise be possible for nova-scheduler to attempt to schedule
>>> new VMs on the compute node with the crashed nova-compute service.
>>>
>>> IIUC, this is one area where masakari is currently more sophisticated
>>> than the approach based on OCF RAs:
>>>
>>> https://github.com/ntt-sic/masakari/blob/master/docs/evacuation_patterns.md#evacuation-patterns
>>>
>>> Does that make sense?
>>
>> Maybe. The script would need to be able to determine based on the
>> provided environment variables whether it's in that situation or not.
> 
> Yep.