[ClusterLabs] Regular pengine warnings after a transient failure

Mon Mar 7 15:03:18 EST 2016

Ken Gaillot <kgaillot at redhat.com> writes:

> On 03/07/2016 07:31 AM, Ferenc Wágner wrote:
> 
>> 12:55:13 vhbl07 crmd[8484]: notice: Transition aborted by vm-eiffel_monitor_60000 'create' on vhbl05: Foreign event (magic=0:0;521:0:0:634eef05-39c1-4093-94d4-8d624b423bb7, cib=0.613.98, source=process_graph_event:600, 0)
>
> That means the action was initiated by a different node (the previous DC
> presumably), so the new DC wants to recalculate everything.

Time travel was sort of possible in that situation, and recurring
monitor operations are not logged, so this is indeed possible.  The main
thing is that it wasn't mishandled.

>> recovery actions turned into start actions for the resources stopped
>> during the previous transition.  However, almost all other recovery
>> actions just disappeared without any comment.  This was actually
>> correct, but I really wonder why the cluster decided to paper over
>> the previous monitor operation timeouts.  Maybe the operations
>> finished meanwhile and got accounted somehow, just not logged?
>
> I'm not sure why the PE decided recovery was not necessary. Operation
> results wouldn't be accepted without being logged.

At which logging level?  I can't see recurring monitor operation logs in
syslog (at default logging level: notice) nor in /var/log/pacemaker.log
(which contains info level messages as well).

However, the info level logs contain more "Transition aborted" lines, as
if only the first of them got logged with notice level.  This would make
sense, since the later ones don't make any difference on an already
aborted transition, so they aren't that important.  And in fact such
lines were suppressed from the syslog I checked first, for example:

12:55:39 [8479] vhbl07        cib:     info: cib_perform_op:     Diff: --- 0.613.120 2
12:55:39 [8479] vhbl07        cib:     info: cib_perform_op:     Diff: +++ 0.613.121 (null)
12:55:39 [8479] vhbl07        cib:     info: cib_perform_op:     +  /cib:  @num_updates=121
12:55:39 [8479] vhbl07        cib:     info: cib_perform_op:     ++ /cib/status/node_state[@id='167773707']/lrm[@id='167773707']/lrm_resources/lrm_resource[@id='vm-elm']:  <lrm_rsc_op id="vm-elm_monitor_60000" operation_key="vm-elm_monitor_60000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" transition-key="473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" transition-magic="0:0;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" on_node="vhbl05" call-id="645" rc-code="0" op-st
12:55:39 [8479] vhbl07        cib:     info: cib_process_request:        Completed cib_modify operation for section status: OK (rc=0, origin=vhbl05/crmd/362, version=0.613.121)
12:55:39 [8484] vhbl07       crmd:     info: abort_transition_graph:     Transition aborted by vm-elm_monitor_60000 'create' on vhbl05: Foreign event (magic=0:0;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7, cib=0.613.121, source=process_graph_event:600, 0)
12:55:39 [8484] vhbl07       crmd:     info: process_graph_event:        Detected action (0.473) vm-elm_monitor_60000.645=ok: initiated by a different node

I can very much imagine this cancelling the FAILED state induced by a
monitor timeout like:

12:54:52 [8479] vhbl07        cib:     info: cib_perform_op:     ++                                               <lrm_resource id="vm-elm" type="TransientDomain" class="ocf" provider="niif">
12:54:52 [8479] vhbl07        cib:     info: cib_perform_op:     ++                                                 <lrm_rsc_op id="vm-elm_last_failure_0" operation_key="vm-elm_monitor_60000" operation="monitor" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10" transition-key="473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" transition-magic="2:1;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" on_node="vhbl05" call-id="645" rc-code="1" op-status="2" interval="60000" last-rc-change="1456833279" exe
12:54:52 [8479] vhbl07        cib:     info: cib_perform_op:     ++                                                 <lrm_rsc_op id="vm-elm_last_0" operation_key="vm-elm_start_0" operation="start" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10" transition-key="472:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" transition-magic="0:0;472:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" on_node="vhbl05" call-id="602" rc-code="0" op-status="0" interval="0" last-run="1456091121" last-rc-change="1456091121" e
12:54:52 [8479] vhbl07        cib:     info: cib_perform_op:     ++                                               </lrm_resource>

The transition-keys match, does this mean that the above is a late
result from the monitor operation which was considered timed-out
previously?  How did it reach vhbl07, if the DC at that time was vhbl03?

> The pe-input files from the transitions around here should help.

They are available.  What shall I look for?

>> Basically, the cluster responded beyond my expectations, sparing lots of
>> unnecessary recoveries or fencing.  I'm happy, thanks for this wonderful
>> software!  But I'm left with these "Processing failed op monitor"
>> warnings emitted every 15 minutes (timer pops).  Is it safe and clever
>> to cleanup the affected resources?  Would that get rid of them without
>> invoking other effects, like recoveries for example?
>
> That's normal; it's how the cluster maintains the effect of a failure
> that has not been cleared. The logs can be confusing, because it's not
> apparent from that message alone whether the failure is new or old.

Ah, do you mean that these are the same thing that appears after "Failed
Actions:" at the end of the crm_mon output?  They certainly match, and
the logs are confusing indeed.

> Cleaning up the resource will end the failure condition, so what happens
> next depends on the configuration and state of the cluster. If the
> failure was preventing a preferred node from running the resource, the
> resource could move, depending on other factors such as stickiness.

These resources are (still) running fine, suffered only monitor failures
and are node-neutral, so it should be safe to cleanup them, I suppose.
-- 
Thanks for your quick and enlightening answer!  I feared the mere length
of my message would scare everybody away...
Regards,
Feri