[ClusterLabs] [OCF] Overhaul of the OCF resource agent api spec

Lars Ellenberg lars+ocf at linbit.com
Tue Feb 10 17:33:51 EST 2015


On Tue, Feb 10, 2015 at 02:45:06PM -0700, Alan Robertson wrote:
> On 02/10/2015 02:09 PM, Lars Ellenberg wrote:
> > On Tue, Feb 10, 2015 at 09:44:40PM +0100, Lars Ellenberg wrote:
> >> > Then we take it from there,
> >> > and do the necessary overhaul of this OCF RA API spec.
> >> > 
> >> > I will followup with a list of items that need to be addressed
> >> > (as I remember them from the discussions we had in Brno).
> > * reserve new exit codes for a probe/monitor action
> >
> >   "running (Started/Slave), but degraded"
> >   "running (Master), but degraded"

> Conventional monitoring systems also provide statuses which indicate a
> marginal condition - "working, but barely" kind of thing.

Can you give an example of something that is
  working properly
  working "degraded"
  working "barely"

Thing is: there is usually nothing pacemaker can do about this
(but to record that status in the CIB, and thus make it digestible
by crm_mon, and all sorts of UIs).
Which means we do not really benefit from that distinction.

What we want to achieve by introducing these additional exit codes
is that an operator who only occasionally checks crm_mon output
or equivalent, and has no proper alerting via additional tactical
monitoring, will not be misled by a resource state of "Running".

Example:
A DRBD Primary lost its disk for some reason.
Right now it would still show up as "Running Master" in crm_mon.
Two days later, the network (or the peer) has some hickup,
and the resource, and everything depending on it, fails.

Had the operator seen "DEGRADED" in crm_mon,
he might have taken action two days earlier.

I don't see how an additional "DEGRADED HELP ME URGENTLY"
would improve the situation further.

We can already "enrich" the feedback from the resource agent via free
form text messages, so we could have more than the exit code alone.

Which in fact means that we could also just forget about the additional
exit codes, and instead specify that some not-so-free form text message
would be recognized as "internal health state" of the resource.

	Lars




More information about the Users mailing list