[ClusterLabs] Regular pengine warnings after a transient failure
Ken Gaillot
kgaillot at redhat.com
Mon Mar 7 18:36:56 EST 2016
On 03/07/2016 02:03 PM, Ferenc Wágner wrote:
> Ken Gaillot <kgaillot at redhat.com> writes:
>
>> On 03/07/2016 07:31 AM, Ferenc Wágner wrote:
>>
>>> 12:55:13 vhbl07 crmd[8484]: notice: Transition aborted by vm-eiffel_monitor_60000 'create' on vhbl05: Foreign event (magic=0:0;521:0:0:634eef05-39c1-4093-94d4-8d624b423bb7, cib=0.613.98, source=process_graph_event:600, 0)
>>
>> That means the action was initiated by a different node (the previous DC
>> presumably), so the new DC wants to recalculate everything.
>
> Time travel was sort of possible in that situation, and recurring
> monitor operations are not logged, so this is indeed possible. The main
> thing is that it wasn't mishandled.
>
>>> recovery actions turned into start actions for the resources stopped
>>> during the previous transition. However, almost all other recovery
>>> actions just disappeared without any comment. This was actually
>>> correct, but I really wonder why the cluster decided to paper over
>>> the previous monitor operation timeouts. Maybe the operations
>>> finished meanwhile and got accounted somehow, just not logged?
>>
>> I'm not sure why the PE decided recovery was not necessary. Operation
>> results wouldn't be accepted without being logged.
>
> At which logging level? I can't see recurring monitor operation logs in
> syslog (at default logging level: notice) nor in /var/log/pacemaker.log
> (which contains info level messages as well).
>
> However, the info level logs contain more "Transition aborted" lines, as
> if only the first of them got logged with notice level. This would make
> sense, since the later ones don't make any difference on an already
> aborted transition, so they aren't that important. And in fact such
> lines were suppressed from the syslog I checked first, for example:
>
> 12:55:39 [8479] vhbl07 cib: info: cib_perform_op: Diff: --- 0.613.120 2
> 12:55:39 [8479] vhbl07 cib: info: cib_perform_op: Diff: +++ 0.613.121 (null)
> 12:55:39 [8479] vhbl07 cib: info: cib_perform_op: + /cib: @num_updates=121
> 12:55:39 [8479] vhbl07 cib: info: cib_perform_op: ++ /cib/status/node_state[@id='167773707']/lrm[@id='167773707']/lrm_resources/lrm_resource[@id='vm-elm']: <lrm_rsc_op id="vm-elm_monitor_60000" operation_key="vm-elm_monitor_60000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" transition-key="473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" transition-magic="0:0;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" on_node="vhbl05" call-id="645" rc-code="0" op-st
> 12:55:39 [8479] vhbl07 cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=vhbl05/crmd/362, version=0.613.121)
> 12:55:39 [8484] vhbl07 crmd: info: abort_transition_graph: Transition aborted by vm-elm_monitor_60000 'create' on vhbl05: Foreign event (magic=0:0;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7, cib=0.613.121, source=process_graph_event:600, 0)
> 12:55:39 [8484] vhbl07 crmd: info: process_graph_event: Detected action (0.473) vm-elm_monitor_60000.645=ok: initiated by a different node
>
> I can very much imagine this cancelling the FAILED state induced by a
> monitor timeout like:
>
> 12:54:52 [8479] vhbl07 cib: info: cib_perform_op: ++ <lrm_resource id="vm-elm" type="TransientDomain" class="ocf" provider="niif">
> 12:54:52 [8479] vhbl07 cib: info: cib_perform_op: ++ <lrm_rsc_op id="vm-elm_last_failure_0" operation_key="vm-elm_monitor_60000" operation="monitor" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10" transition-key="473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" transition-magic="2:1;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" on_node="vhbl05" call-id="645" rc-code="1" op-status="2" interval="60000" last-rc-change="1456833279" exe
> 12:54:52 [8479] vhbl07 cib: info: cib_perform_op: ++ <lrm_rsc_op id="vm-elm_last_0" operation_key="vm-elm_start_0" operation="start" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10" transition-key="472:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" transition-magic="0:0;472:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" on_node="vhbl05" call-id="602" rc-code="0" op-status="0" interval="0" last-run="1456091121" last-rc-change="1456091121" e
> 12:54:52 [8479] vhbl07 cib: info: cib_perform_op: ++ </lrm_resource>
>
> The transition-keys match, does this mean that the above is a late
> result from the monitor operation which was considered timed-out
> previously? How did it reach vhbl07, if the DC at that time was vhbl03?
>
>> The pe-input files from the transitions around here should help.
>
> They are available. What shall I look for?
It's not the most user-friendly of tools, but crm_simulate can show how
the cluster would react to each transition: crm_simulate -Sx $FILE.bz2
Adding -D $FILE.dot will output to a dot file, then dot $FILE.dot -Tpng
> $FILE.png will produce a graphic of the transition, which can be
interpreted the same way as
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#s-config-testing-changes
>>> Basically, the cluster responded beyond my expectations, sparing lots of
>>> unnecessary recoveries or fencing. I'm happy, thanks for this wonderful
>>> software! But I'm left with these "Processing failed op monitor"
>>> warnings emitted every 15 minutes (timer pops). Is it safe and clever
>>> to cleanup the affected resources? Would that get rid of them without
>>> invoking other effects, like recoveries for example?
>>
>> That's normal; it's how the cluster maintains the effect of a failure
>> that has not been cleared. The logs can be confusing, because it's not
>> apparent from that message alone whether the failure is new or old.
>
> Ah, do you mean that these are the same thing that appears after "Failed
> Actions:" at the end of the crm_mon output? They certainly match, and
> the logs are confusing indeed.
Exactly
>> Cleaning up the resource will end the failure condition, so what happens
>> next depends on the configuration and state of the cluster. If the
>> failure was preventing a preferred node from running the resource, the
>> resource could move, depending on other factors such as stickiness.
>
> These resources are (still) running fine, suffered only monitor failures
> and are node-neutral, so it should be safe to cleanup them, I suppose.
Most likely
More information about the Users
mailing list