[ClusterLabs] [OCF] Overhaul of the OCF resource agent api spec

Thu Feb 12 10:14:10 EST 2015

Sent from my iPad

> On 11 Feb 2015, at 14:32, Alan Robertson <alanr at unix.sh> wrote:
> 
>> On 02/10/2015 03:33 PM, Lars Ellenberg wrote:
>>> On Tue, Feb 10, 2015 at 02:45:06PM -0700, Alan Robertson wrote:
>>>> On 02/10/2015 02:09 PM, Lars Ellenberg wrote:
>>>> On Tue, Feb 10, 2015 at 09:44:40PM +0100, Lars Ellenberg wrote:
>>>>>> Then we take it from there,
>>>>>> and do the necessary overhaul of this OCF RA API spec.
>>>>>> 
>>>>>> I will followup with a list of items that need to be addressed
>>>>>> (as I remember them from the discussions we had in Brno).
>>>> * reserve new exit codes for a probe/monitor action
>>>> 
>>>>  "running (Started/Slave), but degraded"
>>>>  "running (Master), but degraded"
>>> Conventional monitoring systems also provide statuses which indicate a
>>> marginal condition - "working, but barely" kind of thing.
>> Can you give an example of something that is
>>  working properly
>>  working "degraded"
>>  working "barely"
>> 
>> Thing is: there is usually nothing pacemaker can do about this
>> (but to record that status in the CIB, and thus make it digestible
>> by crm_mon, and all sorts of UIs).
>> Which means we do not really benefit from that distinction.
>> 
>> What we want to achieve by introducing these additional exit codes
>> is that an operator who only occasionally checks crm_mon output
>> or equivalent, and has no proper alerting via additional tactical
>> monitoring, will not be misled by a resource state of "Running".
>> 
>> Example:
>> A DRBD Primary lost its disk for some reason.
>> Right now it would still show up as "Running Master" in crm_mon.
>> Two days later, the network (or the peer) has some hickup,
>> and the resource, and everything depending on it, fails.
>> 
>> Had the operator seen "DEGRADED" in crm_mon,
>> he might have taken action two days earlier.
>> 
>> I don't see how an additional "DEGRADED HELP ME URGENTLY"
>> would improve the situation further.
>> 
>> We can already "enrich" the feedback from the resource agent via free
>> form text messages, so we could have more than the exit code alone.
>> 
>> Which in fact means that we could also just forget about the additional
>> exit codes, and instead specify that some not-so-free form text message
>> would be recognized as "internal health state" of the resource.
> 
> I'm not using the OCF RA with Pacemaker.  I use it for alerting in
> Assimilation.  Free-form would work, but then it couldn't be free form -
> it would have to have some structure to it ;-).  Certainly exit codes
> would be useful here to allow the free form text to be free form.
> 
> If you made the "free form" text to be JSON, then you could eliminate
> exit codes altogether - but I think the strict structure of exit codes
> serves a purpose of making it easier to decipher the meaning of what was
> observed.

Exactly. Exit codes are authoritative, the text is optional and intended to be purely informational for the benefit of the end user.

> _______________________________________________
> OCF mailing list
> OCF at lists.community.tummy.com
> http://lists.community.tummy.com/mailman/listinfo/ocf