[Pacemaker] [Problem]Number of times control of the fail-count is late.
renayama19661014 at ybb.ne.jp
renayama19661014 at ybb.ne.jp
Tue Nov 9 23:20:48 EST 2010
Hi,
We constituted a cluster by two node constitution.
The migration-threshold set it to 2.
We confirmed a phenomenon in the next procedure.
Step1) Start two nodes and send config5.crm. (The clnDiskd-resources is original.)
============
Last updated: Tue Nov 9 21:10:49 2010
Stack: Heartbeat
Current DC: srv02 (8c93dc22-a27e-409b-8112-4073de622daf) - partition with quorum
Version: 1.0.9-0a40fd0cb9f2fcedef9d1967115c912314c57438
2 Nodes configured, unknown expected votes
5 Resources configured.
============
Online: [ srv01 srv02 ]
vip (ocf::heartbeat:IPaddr2): Started srv01
Clone Set: clnDiskd
Started: [ srv01 srv02 ]
Clone Set: clnDummy2
Started: [ srv01 srv02 ]
Clone Set: clnPingd1
Started: [ srv01 srv02 ]
Node Attributes:
* Node srv01:
+ default_ping_set1 : 100
+ diskcheck_status_internal : normal
* Node srv02:
+ default_ping_set1 : 100
+ diskcheck_status_internal : normal
Migration summary:
* Node srv02:
* Node srv01:
Step2) We edit a clnDummy2 resource to raise time-out in start. (add sleep)
dummy_start() {
sleep 180 ----> add sleep
dummy_monitor
if [ $? = $OCF_SUCCESS ]; then
Step3) It causes a monitor error in a clnDummy2 resource.
# rm -rf /var/run/Dummy-Dummy2.state
Step4) clnDummy2 causes time-out by restart.
But, as for clnDummy2, a lot of starts are up after time-out once when they watch log.
In fact, the reason is because pengine does not know that fail-count became INFINITY.
Because the reason is because fail-count does not yet become INFINITY in pe-input-2001.bz2.
In pe-input-2002.bz2, fail-count becomes INFINITY.
(snip)
Nov 9 21:12:35 srv02 crmd: [5896]: WARN: status_from_rc: Action 25 (Dummy2:0_start_0) on srv01 failed
(target: 0 vs. rc: -2): Error
Nov 9 21:12:35 srv02 crmd: [5896]: WARN: update_failcount: Updating failcount for Dummy2:0 on srv01
after failed start: rc=-2 (update=INFINITY, time=1289304755)
Nov 9 21:12:35 srv02 crmd: [5896]: info: abort_transition_graph: match_graph_event:272 - Triggered
transition abort (complete=0, tag=lrm_rsc_op, id=Dummy2:0_start_0,
magic=2:-2;25:5:0:275da7f9-7f43-43a2-8308-41d0ab78346e, cib=0.9.39) : Event failed
Nov 9 21:12:35 srv02 crmd: [5896]: info: match_graph_event: Action Dummy2:0_start_0 (25) confirmed on
srv01 (rc=4)
Nov 9 21:12:35 srv02 crmd: [5896]: info: te_pseudo_action: Pseudo action 29 fired and confirmed
Nov 9 21:12:35 srv02 crmd: [5896]: info: run_graph:
====================================================
Nov 9 21:12:35 srv02 crmd: [5896]: notice: run_graph: Transition 5 (Complete=7, Pending=0, Fired=0,
Skipped=1, Incomplete=0, Source=/var/lib/pengine/pe-input-2000.bz2): Stopped
Nov 9 21:12:35 srv02 crmd: [5896]: info: te_graph_trigger: Transition 5 is now complete
Nov 9 21:12:35 srv02 crmd: [5896]: info: do_state_transition: State transition S_TRANSITION_ENGINE ->
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
Nov 9 21:12:35 srv02 crmd: [5896]: info: do_state_transition: All 2 cluster nodes are eligible to run
resources.
Nov 9 21:12:35 srv02 crmd: [5896]: info: do_pe_invoke: Query 72: Requesting the current CIB:
S_POLICY_ENGINE
Nov 9 21:12:35 srv02 crmd: [5896]: info: do_pe_invoke_callback: Invoking the PE: query=72,
ref=pe_calc-dc-1289304755-58, seq=2, quorate=1
Nov 9 21:12:35 srv02 pengine: [7208]: notice: unpack_config: On loss of CCM Quorum: Ignore
Nov 9 21:12:35 srv02 pengine: [7208]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' =
0, 'green' = 0
Nov 9 21:12:35 srv02 pengine: [7208]: WARN: unpack_nodes: Blind faith: not fencing unseen nodes
Nov 9 21:12:35 srv02 pengine: [7208]: info: determine_online_status: Node srv02 is online
Nov 9 21:12:35 srv02 pengine: [7208]: info: determine_online_status: Node srv01 is online
Nov 9 21:12:35 srv02 pengine: [7208]: WARN: unpack_rsc_op: Processing failed op
Dummy2:0_monitor_15000 on srv01: not running (7)
Nov 9 21:12:35 srv02 pengine: [7208]: WARN: unpack_rsc_op: Processing failed op Dummy2:0_start_0 on
srv01: unknown exec error (-2)
Nov 9 21:12:35 srv02 pengine: [7208]: notice: native_print: Dummy (ocf::pacemaker:Dummy): Started
srv01
Nov 9 21:12:35 srv02 pengine: [7208]: notice: native_print: vip (ocf::heartbeat:IPaddr2): Started
srv01
Nov 9 21:12:35 srv02 pengine: [7208]: notice: clone_print: Clone Set: clnDiskd
Nov 9 21:12:35 srv02 pengine: [7208]: notice: short_print: Started: [ srv01 srv02 ]
Nov 9 21:12:35 srv02 pengine: [7208]: notice: clone_print: Clone Set: clnDummy2
Nov 9 21:12:35 srv02 pengine: [7208]: notice: native_print: Dummy2:0 (ocf::pacemaker:Dummy2):
Started srv01 FAILED
Nov 9 21:12:35 srv02 pengine: [7208]: notice: short_print: Started: [ srv02 ]
Nov 9 21:12:35 srv02 pengine: [7208]: notice: clone_print: Clone Set: clnPingd1
Nov 9 21:12:35 srv02 pengine: [7208]: notice: short_print: Started: [ srv01 srv02 ]
Nov 9 21:12:35 srv02 pengine: [7208]: info: get_failcount: clnDummy2 has failed 1 times on srv01
Nov 9 21:12:35 srv02 pengine: [7208]: notice: common_apply_stickiness: clnDummy2 can fail 1 more
times on srv01 before being forced off
Nov 9 21:12:35 srv02 pengine: [7208]: info: get_failcount: clnDummy2 has failed 1 times on srv01
Nov 9 21:12:35 srv02 pengine: [7208]: notice: common_apply_stickiness: clnDummy2 can fail 1 more
times on srv01 before being forced off
Nov 9 21:12:35 srv02 pengine: [7208]: ERROR: unpack_operation: Specifying on_fail=fence and
stonith-enabled=false makes no sense
Nov 9 21:12:35 srv02 pengine: [7208]: notice: RecurringOp: Start recurring monitor (15s) for
Dummy2:0 on srv01
Nov 9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource Dummy (Started srv01)
Nov 9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource vip (Started srv01)
Nov 9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource prmDiskd:0 (Started srv01)
Nov 9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource prmDiskd:1 (Started srv02)
Nov 9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Recover resource Dummy2:0 (Started srv01)
Nov 9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource Dummy2:1 (Started srv02)
Nov 9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource prmPingd1:0 (Started srv01)
Nov 9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource prmPingd1:1 (Started srv02)
Nov 9 21:12:35 srv02 crmd: [5896]: info: do_state_transition: State transition S_POLICY_ENGINE ->
S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Nov 9 21:12:35 srv02 crmd: [5896]: info: unpack_graph: Unpacked transition 6: 8 actions in 8 synapses
Nov 9 21:12:35 srv02 crmd: [5896]: info: do_te_invoke: Processing graph 6
(ref=pe_calc-dc-1289304755-58) derived from /var/lib/pengine/pe-input-2001.bz2
Nov 9 21:12:35 srv02 crmd: [5896]: info: te_pseudo_action: Pseudo action 30 fired and confirmed
Nov 9 21:12:35 srv02 crmd: [5896]: info: te_rsc_command: Initiating action 5: stop Dummy2:0_stop_0 on
srv01
Nov 9 21:12:35 srv02 pengine: [7208]: info: process_pe_message: Transition 6: PEngine Input stored
in: /var/lib/pengine/pe-input-2001.bz2
Nov 9 21:12:35 srv02 pengine: [7208]: info: process_pe_message: Configuration ERRORs found during PE
processing. Please run "crm_verify -L" to identify issues.
Nov 9 21:12:37 srv02 attrd: [5895]: info: attrd_ha_callback: flush message from srv01
Nov 9 21:12:37 srv02 crmd: [5896]: info: match_graph_event: Action Dummy2:0_stop_0 (5) confirmed on
srv01 (rc=0)
Nov 9 21:12:37 srv02 crmd: [5896]: info: te_pseudo_action: Pseudo action 31 fired and confirmed
Nov 9 21:12:37 srv02 crmd: [5896]: info: te_pseudo_action: Pseudo action 8 fired and confirmed
Nov 9 21:12:37 srv02 crmd: [5896]: info: te_pseudo_action: Pseudo action 28 fired and confirmed
Nov 9 21:12:37 srv02 crmd: [5896]: info: te_rsc_command: Initiating action 24: start Dummy2:0_start_0
on srv01
-----> Must not carry out this start.
Nov 9 21:12:37 srv02 crmd: [5896]: info: abort_transition_graph: te_update_diff:146 - Triggered
transition abort (complete=0, tag=transient_attributes, id=519bb7a2-3c31-414a-b6b2-eaef36a09fb7,
magic=NA, cib=0.9.41) : Transient attribute: update
Nov 9 21:12:37 srv02 crmd: [5896]: info: update_abort_priority: Abort priority upgraded from 0 to
1000000
Nov 9 21:12:37 srv02 crmd: [5896]: info: update_abort_priority: Abort action done superceeded by
restart
Nov 9 21:12:37 srv02 crmd: [5896]: info: abort_transition_graph: te_update_diff:146 - Triggered
transition abort (complete=0, tag=transient_attributes, id=519bb7a2-3c31-414a-b6b2-eaef36a09fb7,
magic=NA, cib=0.9.42) : Transient attribute: update
(snip)
It seems to be a problem that update of fail-count was late.
But, this problem seems to occur by a timing.
It affects it in fail over time of the resource that the control number of times of fail-count is
wrong.
Is this problem already discussed?
Is not a delay of the update of fail-count which went by way of attrd a problem?
* I attach log and some pe-files at Bugzilla.
* http://developerbugs.linux-foundation.org/show_bug.cgi?id=2520
Best Regards,
Hideo Yamauchi.
More information about the Pacemaker
mailing list