[Pacemaker] DC marks remote operation as timed out, leading to stonith

Tue Aug 9 04:40:31 EDT 2011

Hi Andrew, all,

Just found little bit strange crmd behaviour - DC timed out stop
operation (which lead to node fencing) by itself, without even waiting
for lrmd (on another node) to finish that operation.

Here is what it did:
===============
Aug  9 07:30:07 mgmt01 crmd: [15781]: WARN: action_timer_callback: Timer
popped (timeout=600000, abort_level=1000000, complete=false)
Aug  9 07:30:07 mgmt01 crmd: [15781]: ERROR: print_elem: Aborting
transition, action lost: [Action 228]: In-flight (id:
vptest22.vds-ok.com-vm_stop_0, loc: v03-a, priority: 0)
Aug  9 07:30:07 mgmt01 crmd: [15781]: info: abort_transition_graph:
action_timer_callback:486 - Triggered transition abort (complete=0) :
Action lost
Aug  9 07:30:07 mgmt01 crmd: [15781]: WARN: cib_action_update: rsc_op
228: vptest22.vds-ok.com-vm_stop_0 on v03-a timed out
Aug  9 07:30:07 mgmt01 crmd: [15781]: info: create_operation_update:
cib_action_update: Updating resouce vptest22.vds-ok.com-vm after Timed
Out stop op (interval=0)
Aug  9 07:30:07 mgmt01 crmd: [15781]: notice: run_graph: Transition 1500
(Complete=64, Pending=0, Fired=0, Skipped=40, Incomplete=0,
Source=/var/lib/pengine/pe-input-19.bz2): Stopped
Aug  9 07:30:07 mgmt01 crmd: [15781]: info: te_graph_trigger: Transition
1500 is now complete
Aug  9 07:30:07 mgmt01 crmd: [15781]: info: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_FSA_INTERNAL origin=notify_crmd ]
Aug  9 07:30:07 mgmt01 crmd: [15781]: info: do_state_transition: All 3
cluster nodes are eligible to run resources.
Aug  9 07:30:07 mgmt01 crmd: [15781]: info: process_graph_event: Action
vptest22.vds-ok.com-vm_stop_0 arrived after a completed transition
Aug  9 07:30:07 mgmt01 crmd: [15781]: info: abort_transition_graph:
process_graph_event:482 - Triggered transition abort (complete=1,
tag=lrm_rsc_op, id=vptest22.vds-ok.com-vm_last_failure_0,
magic=2:1;228:1500:0:c9c55d75-3a1e-4fcc-af7f-81ede42fa918,
cib=0.4519.88) : Inactive graph
Aug  9 07:30:07 mgmt01 crmd: [15781]: WARN: update_failcount: Updating
failcount for vptest22.vds-ok.com-vm on v03-a after failed stop: rc=1
(update=INFINITY, time=1312875007)
Aug  9 07:30:08 mgmt01 crmd: [15781]: info: abort_transition_graph:
te_update_diff:142 - Triggered transition abort (complete=1, tag=nvpair,
id=status-v03-a-fail-count-vptest22.vds-ok.com-vm,
name=fail-count-vptest22.vds-ok.com-vm, value=INFINITY, magic=NA,
cib=0.4519.91) : Transient attribute: update
Aug  9 07:30:10 mgmt01 pengine: [15780]: WARN: pe_fence_node: Node v03-a
will be fenced to recover from resource failure(s)
Aug  9 07:30:10 mgmt01 pengine: [15780]: WARN: unpack_rsc_op: Processing
failed op vptest22.vds-ok.com-vm_last_failure_0 on v03-a: unknown error (1)
==============

That was a mass stop of 60 virtual machines on 2 12-core nodes. I
suspect that nodes had very high load that time, so it is possible that
crmd and lrmd on node being fenced had no chance to get enough CPU
cycles from scheduler. I saw messages from lrmd there that some
operations stayed in op list for a very long time (~10 min). But, last
such message was logged 1 minute before DC decided to timeout that stop
operation. During that minute only some informative messages from some
resource agents were logged, nothing from lrmd or crmd.

What could cause such behaviuor? High CPU load?

Is it intended that DC may do that (mark operation as timed out)?

What can be done to prevent such things in future? Bigger scheduling
priority (up to real-time) for lrmd (or for whole pacemaker subsystem)?
Bigger timeouts?

Should I post hb_report for that (if it is not intended of course)?

Thank you,
Vladislav