[ClusterLabs] [OCF] Overhaul of the OCF resource agent api spec

Wed Feb 11 08:32:39 EST 2015

On 02/10/2015 03:33 PM, Lars Ellenberg wrote:
> On Tue, Feb 10, 2015 at 02:45:06PM -0700, Alan Robertson wrote:
>> On 02/10/2015 02:09 PM, Lars Ellenberg wrote:
>>> On Tue, Feb 10, 2015 at 09:44:40PM +0100, Lars Ellenberg wrote:
>>>>> Then we take it from there,
>>>>> and do the necessary overhaul of this OCF RA API spec.
>>>>>
>>>>> I will followup with a list of items that need to be addressed
>>>>> (as I remember them from the discussions we had in Brno).
>>> * reserve new exit codes for a probe/monitor action
>>>
>>>   "running (Started/Slave), but degraded"
>>>   "running (Master), but degraded"
>> Conventional monitoring systems also provide statuses which indicate a
>> marginal condition - "working, but barely" kind of thing.
> Can you give an example of something that is
>   working properly
>   working "degraded"
>   working "barely"
>
> Thing is: there is usually nothing pacemaker can do about this
> (but to record that status in the CIB, and thus make it digestible
> by crm_mon, and all sorts of UIs).
> Which means we do not really benefit from that distinction.
>
> What we want to achieve by introducing these additional exit codes
> is that an operator who only occasionally checks crm_mon output
> or equivalent, and has no proper alerting via additional tactical
> monitoring, will not be misled by a resource state of "Running".
>
> Example:
> A DRBD Primary lost its disk for some reason.
> Right now it would still show up as "Running Master" in crm_mon.
> Two days later, the network (or the peer) has some hickup,
> and the resource, and everything depending on it, fails.
>
> Had the operator seen "DEGRADED" in crm_mon,
> he might have taken action two days earlier.
>
> I don't see how an additional "DEGRADED HELP ME URGENTLY"
> would improve the situation further.
>
> We can already "enrich" the feedback from the resource agent via free
> form text messages, so we could have more than the exit code alone.
>
> Which in fact means that we could also just forget about the additional
> exit codes, and instead specify that some not-so-free form text message
> would be recognized as "internal health state" of the resource.

I'm not using the OCF RA with Pacemaker.  I use it for alerting in
Assimilation.  Free-form would work, but then it couldn't be free form -
it would have to have some structure to it ;-).  Certainly exit codes
would be useful here to allow the free form text to be free form.

If you made the "free form" text to be JSON, then you could eliminate
exit codes altogether - but I think the strict structure of exit codes
serves a purpose of making it easier to decipher the meaning of what was
observed.