[ClusterLabs] Regular pengine warnings after a transient failure

Mon Mar 7 18:36:56 EST 2016

On 03/07/2016 02:03 PM, Ferenc Wágner wrote:
> Ken Gaillot <kgaillot at redhat.com> writes:
> 
>> On 03/07/2016 07:31 AM, Ferenc Wágner wrote:
>>
>>> 12:55:13 vhbl07 crmd[8484]: notice: Transition aborted by vm-eiffel_monitor_60000 'create' on vhbl05: Foreign event (magic=0:0;521:0:0:634eef05-39c1-4093-94d4-8d624b423bb7, cib=0.613.98, source=process_graph_event:600, 0)
>>
>> That means the action was initiated by a different node (the previous DC
>> presumably), so the new DC wants to recalculate everything.
> 
> Time travel was sort of possible in that situation, and recurring
> monitor operations are not logged, so this is indeed possible.  The main
> thing is that it wasn't mishandled.
> 
>>> recovery actions turned into start actions for the resources stopped
>>> during the previous transition.  However, almost all other recovery
>>> actions just disappeared without any comment.  This was actually
>>> correct, but I really wonder why the cluster decided to paper over
>>> the previous monitor operation timeouts.  Maybe the operations
>>> finished meanwhile and got accounted somehow, just not logged?
>>
>> I'm not sure why the PE decided recovery was not necessary. Operation
>> results wouldn't be accepted without being logged.
> 
> At which logging level?  I can't see recurring monitor operation logs in
> syslog (at default logging level: notice) nor in /var/log/pacemaker.log
> (which contains info level messages as well).
> 
> However, the info level logs contain more "Transition aborted" lines, as
> if only the first of them got logged with notice level.  This would make
> sense, since the later ones don't make any difference on an already
> aborted transition, so they aren't that important.  And in fact such
> lines were suppressed from the syslog I checked first, for example:
> 
> 12:55:39 [8479] vhbl07        cib:     info: cib_perform_op:     Diff: --- 0.613.120 2
> 12:55:39 [8479] vhbl07        cib:     info: cib_perform_op:     Diff: +++ 0.613.121 (null)
> 12:55:39 [8479] vhbl07        cib:     info: cib_perform_op:     +  /cib:  @num_updates=121
> 12:55:39 [8479] vhbl07        cib:     info: cib_perform_op:     ++ /cib/status/node_state[@id='167773707']/lrm[@id='167773707']/lrm_resources/lrm_resource[@id='vm-elm']:  <lrm_rsc_op id="vm-elm_monitor_60000" operation_key="vm-elm_monitor_60000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" transition-key="473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" transition-magic="0:0;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" on_node="vhbl05" call-id="645" rc-code="0" op-st
> 12:55:39 [8479] vhbl07        cib:     info: cib_process_request:        Completed cib_modify operation for section status: OK (rc=0, origin=vhbl05/crmd/362, version=0.613.121)
> 12:55:39 [8484] vhbl07       crmd:     info: abort_transition_graph:     Transition aborted by vm-elm_monitor_60000 'create' on vhbl05: Foreign event (magic=0:0;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7, cib=0.613.121, source=process_graph_event:600, 0)
> 12:55:39 [8484] vhbl07       crmd:     info: process_graph_event:        Detected action (0.473) vm-elm_monitor_60000.645=ok: initiated by a different node
> 
> I can very much imagine this cancelling the FAILED state induced by a
> monitor timeout like:
> 
> 12:54:52 [8479] vhbl07        cib:     info: cib_perform_op:     ++                                               <lrm_resource id="vm-elm" type="TransientDomain" class="ocf" provider="niif">
> 12:54:52 [8479] vhbl07        cib:     info: cib_perform_op:     ++                                                 <lrm_rsc_op id="vm-elm_last_failure_0" operation_key="vm-elm_monitor_60000" operation="monitor" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10" transition-key="473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" transition-magic="2:1;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" on_node="vhbl05" call-id="645" rc-code="1" op-status="2" interval="60000" last-rc-change="1456833279" exe
> 12:54:52 [8479] vhbl07        cib:     info: cib_perform_op:     ++                                                 <lrm_rsc_op id="vm-elm_last_0" operation_key="vm-elm_start_0" operation="start" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10" transition-key="472:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" transition-magic="0:0;472:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" on_node="vhbl05" call-id="602" rc-code="0" op-status="0" interval="0" last-run="1456091121" last-rc-change="1456091121" e
> 12:54:52 [8479] vhbl07        cib:     info: cib_perform_op:     ++                                               </lrm_resource>
> 
> The transition-keys match, does this mean that the above is a late
> result from the monitor operation which was considered timed-out
> previously?  How did it reach vhbl07, if the DC at that time was vhbl03?
> 
>> The pe-input files from the transitions around here should help.
> 
> They are available.  What shall I look for?

It's not the most user-friendly of tools, but crm_simulate can show how
the cluster would react to each transition: crm_simulate -Sx $FILE.bz2

Adding -D $FILE.dot will output to a dot file, then dot $FILE.dot -Tpng
> $FILE.png will produce a graphic of the transition, which can be
interpreted the same way as
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#s-config-testing-changes

>>> Basically, the cluster responded beyond my expectations, sparing lots of
>>> unnecessary recoveries or fencing.  I'm happy, thanks for this wonderful
>>> software!  But I'm left with these "Processing failed op monitor"
>>> warnings emitted every 15 minutes (timer pops).  Is it safe and clever
>>> to cleanup the affected resources?  Would that get rid of them without
>>> invoking other effects, like recoveries for example?
>>
>> That's normal; it's how the cluster maintains the effect of a failure
>> that has not been cleared. The logs can be confusing, because it's not
>> apparent from that message alone whether the failure is new or old.
> 
> Ah, do you mean that these are the same thing that appears after "Failed
> Actions:" at the end of the crm_mon output?  They certainly match, and
> the logs are confusing indeed.

Exactly

>> Cleaning up the resource will end the failure condition, so what happens
>> next depends on the configuration and state of the cluster. If the
>> failure was preventing a preferred node from running the resource, the
>> resource could move, depending on other factors such as stickiness.
> 
> These resources are (still) running fine, suffered only monitor failures
> and are node-neutral, so it should be safe to cleanup them, I suppose.

Most likely