[Pacemaker] RFC: Better error reporting for RAs.

Mon Aug 3 05:27:17 EDT 2009

On 2009-08-02T11:14:35, Andrew Beekhof <andrew at beekhof.net> wrote:

>> The transition graph already includes an unique identifier for each
>> action. If this was made, maybe, a bit shorter, and provided to the RA
>> as part of the environment, the RA could include this as part of each
>> log message - and if then this was also included in the CIB,
>> crm_mon/pengine could provide the key which users could feed to grep and
>> much more quickly find out what exactly has been going wrong.
> That would be transaction_key, which tells you which crmd instance, graph, 
> action number, and expected result every action has.
> Just log it at the various places you want.

Doesn't get passed to the RA though, or am I missing something? It's not
in the environment.

> Though I don't see the point, grepping for the resource id is usually just 
> as effective.

The problem is that it isn't. That shows up all PE messages, too much of
the TE etc, and the lines which tend to be extremely obvious - because
they are actually tagged with "ERRROR:" and repeat several times are the
PE ones, which confuses users.

We need a better way to backtrack from the log message to the actual
invocation which caused the error to be recorded.

Actually the transition key _is_ already in the CIB and at least the
start is recorded in the logs (but not the completion, which would
presumably cheap to add) - except that it is a bit longish. Would you
object to logging the completion event on the node where it was run too,
and possibly including it in crm_mon/pengine logs when they log an ERROR
during unpack_rsc?

(I'd suggest the lrm_rsc_op id attribute, but that is not unique within
the cluster, neither on the time nor node axis.)

>> 2. Verbose error reporting
>>
>> The PE et al only care and interpret the exit code. While the exit code
>> is differentiated enough to categorize the error and allows the cluster
>> to figure out how to respond, it is not sufficient for users to figure
>> out what is wrong. Case in point: "not installed" - what, exactly, is
>> not installed?
>
> Entirely dependent on the RA as you well know.

Exactly, that is the point why better/more verbose reporting would be
welcome - right now, all the RA can do is log, which sucks for users,
because they get lost in the multitude of logs we spew.

>> A possible thought would be for the RA to print a one-line summary to
>> stderr, and record this in the CIB along with the machine-readable
>> encoded error. This would only be used for reporting to users.
> No.
> We already log error output when an action fails.

That is not helpful enough for users. If you doubt that, read some bug
reports ;-)

> Again, easily found by grepping for the resource ID.

Not entirely true; the resource id is abundant in the logs. I want to be
able to easily identify the section related to the particular, specific
invocation.

> I'd suggest focusing on improving the error logging that most RAs have 
> rather than adding yet more mechanisms for achieving the same thing.

It is not the same thing. It would allow crm_mon or the GUI to display
something more verbose and thus useful to users, and reduce the work
load for the poor souls having to analyse the bug reports.

I agree it goes beyond the first proposal, which is why I split it into
two sections. But no doubt it'd improve the ability for users to
maintain their clusters.

And yes, of course this only makes sense if RAs improve their error
logging at the same time (otherwise, there's no "one line summary" to
record).

Regards,
    Lars

-- 
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde