[Pacemaker] [Problem]Number of times control of the fail-count is late.

Thu Nov 11 10:48:59 UTC 2010

On Wed, Nov 10, 2010 at 5:20 AM,  <renayama19661014 at ybb.ne.jp> wrote:
> Hi,
>
> We constituted a cluster by two node constitution.
> The migration-threshold set it to 2.
>
> We confirmed a phenomenon in the next procedure.
>
> Step1) Start two nodes and send config5.crm. (The clnDiskd-resources is original.)
>
> ============
> Last updated: Tue Nov  9 21:10:49 2010
> Stack: Heartbeat
> Current DC: srv02 (8c93dc22-a27e-409b-8112-4073de622daf) - partition with quorum
> Version: 1.0.9-0a40fd0cb9f2fcedef9d1967115c912314c57438
> 2 Nodes configured, unknown expected votes
> 5 Resources configured.
> ============
>
> Online: [ srv01 srv02 ]
>
>  vip    (ocf::heartbeat:IPaddr2):       Started srv01
>  Clone Set: clnDiskd
>     Started: [ srv01 srv02 ]
>  Clone Set: clnDummy2
>     Started: [ srv01 srv02 ]
>  Clone Set: clnPingd1
>     Started: [ srv01 srv02 ]
>
> Node Attributes:
> * Node srv01:
>    + default_ping_set1                 : 100
>    + diskcheck_status_internal         : normal
> * Node srv02:
>    + default_ping_set1                 : 100
>    + diskcheck_status_internal         : normal
>
> Migration summary:
> * Node srv02:
> * Node srv01:
>
>
> Step2) We edit a clnDummy2 resource to raise time-out in start. (add sleep)
>
>  dummy_start() {
>    sleep 180 ----> add sleep
>    dummy_monitor
>    if [ $? =  $OCF_SUCCESS ]; then
>
>
> Step3) It causes a monitor error in a clnDummy2 resource.
>
>  # rm -rf /var/run/Dummy-Dummy2.state
>
> Step4) clnDummy2 causes time-out by restart.
>
> But, as for clnDummy2, a lot of starts are up after time-out once when they watch log.
> In fact, the reason is because pengine does not know that fail-count became INFINITY.
>
> Because the reason is because fail-count does not yet become INFINITY in pe-input-2001.bz2.
> In pe-input-2002.bz2, fail-count becomes INFINITY.
>
> (snip)
> Nov  9 21:12:35 srv02 crmd: [5896]: WARN: status_from_rc: Action 25 (Dummy2:0_start_0) on srv01 failed
> (target: 0 vs. rc: -2): Error
> Nov  9 21:12:35 srv02 crmd: [5896]: WARN: update_failcount: Updating failcount for Dummy2:0 on srv01
> after failed start: rc=-2 (update=INFINITY, time=1289304755)
> Nov  9 21:12:35 srv02 crmd: [5896]: info: abort_transition_graph: match_graph_event:272 - Triggered
> transition abort (complete=0, tag=lrm_rsc_op, id=Dummy2:0_start_0,
> magic=2:-2;25:5:0:275da7f9-7f43-43a2-8308-41d0ab78346e, cib=0.9.39) : Event failed
> Nov  9 21:12:35 srv02 crmd: [5896]: info: match_graph_event: Action Dummy2:0_start_0 (25) confirmed on
> srv01 (rc=4)
> Nov  9 21:12:35 srv02 crmd: [5896]: info: te_pseudo_action: Pseudo action 29 fired and confirmed
> Nov  9 21:12:35 srv02 crmd: [5896]: info: run_graph:
> ====================================================
> Nov  9 21:12:35 srv02 crmd: [5896]: notice: run_graph: Transition 5 (Complete=7, Pending=0, Fired=0,
> Skipped=1, Incomplete=0, Source=/var/lib/pengine/pe-input-2000.bz2): Stopped
> Nov  9 21:12:35 srv02 crmd: [5896]: info: te_graph_trigger: Transition 5 is now complete
> Nov  9 21:12:35 srv02 crmd: [5896]: info: do_state_transition: State transition S_TRANSITION_ENGINE ->
> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
> Nov  9 21:12:35 srv02 crmd: [5896]: info: do_state_transition: All 2 cluster nodes are eligible to run
> resources.
> Nov  9 21:12:35 srv02 crmd: [5896]: info: do_pe_invoke: Query 72: Requesting the current CIB:
> S_POLICY_ENGINE
> Nov  9 21:12:35 srv02 crmd: [5896]: info: do_pe_invoke_callback: Invoking the PE: query=72,
> ref=pe_calc-dc-1289304755-58, seq=2, quorate=1
> Nov  9 21:12:35 srv02 pengine: [7208]: notice: unpack_config: On loss of CCM Quorum: Ignore
> Nov  9 21:12:35 srv02 pengine: [7208]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' =
> 0, 'green' = 0
> Nov  9 21:12:35 srv02 pengine: [7208]: WARN: unpack_nodes: Blind faith: not fencing unseen nodes
> Nov  9 21:12:35 srv02 pengine: [7208]: info: determine_online_status: Node srv02 is online
> Nov  9 21:12:35 srv02 pengine: [7208]: info: determine_online_status: Node srv01 is online
> Nov  9 21:12:35 srv02 pengine: [7208]: WARN: unpack_rsc_op: Processing failed op
> Dummy2:0_monitor_15000 on srv01: not running (7)
> Nov  9 21:12:35 srv02 pengine: [7208]: WARN: unpack_rsc_op: Processing failed op Dummy2:0_start_0 on
> srv01: unknown exec error (-2)
> Nov  9 21:12:35 srv02 pengine: [7208]: notice: native_print: Dummy      (ocf::pacemaker:Dummy): Started
> srv01
> Nov  9 21:12:35 srv02 pengine: [7208]: notice: native_print: vip        (ocf::heartbeat:IPaddr2):       Started
> srv01
> Nov  9 21:12:35 srv02 pengine: [7208]: notice: clone_print:  Clone Set: clnDiskd
> Nov  9 21:12:35 srv02 pengine: [7208]: notice: short_print:      Started: [ srv01 srv02 ]
> Nov  9 21:12:35 srv02 pengine: [7208]: notice: clone_print:  Clone Set: clnDummy2
> Nov  9 21:12:35 srv02 pengine: [7208]: notice: native_print:      Dummy2:0      (ocf::pacemaker:Dummy2):
> Started srv01 FAILED
> Nov  9 21:12:35 srv02 pengine: [7208]: notice: short_print:      Started: [ srv02 ]
> Nov  9 21:12:35 srv02 pengine: [7208]: notice: clone_print:  Clone Set: clnPingd1
> Nov  9 21:12:35 srv02 pengine: [7208]: notice: short_print:      Started: [ srv01 srv02 ]
> Nov  9 21:12:35 srv02 pengine: [7208]: info: get_failcount: clnDummy2 has failed 1 times on srv01
> Nov  9 21:12:35 srv02 pengine: [7208]: notice: common_apply_stickiness: clnDummy2 can fail 1 more
> times on srv01 before being forced off
> Nov  9 21:12:35 srv02 pengine: [7208]: info: get_failcount: clnDummy2 has failed 1 times on srv01
> Nov  9 21:12:35 srv02 pengine: [7208]: notice: common_apply_stickiness: clnDummy2 can fail 1 more
> times on srv01 before being forced off
> Nov  9 21:12:35 srv02 pengine: [7208]: ERROR: unpack_operation: Specifying on_fail=fence and
> stonith-enabled=false makes no sense
> Nov  9 21:12:35 srv02 pengine: [7208]: notice: RecurringOp:  Start recurring monitor (15s) for
> Dummy2:0 on srv01
> Nov  9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource Dummy (Started srv01)
> Nov  9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource vip   (Started srv01)
> Nov  9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource prmDiskd:0    (Started srv01)
> Nov  9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource prmDiskd:1    (Started srv02)
> Nov  9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Recover resource Dummy2:0    (Started srv01)
> Nov  9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource Dummy2:1      (Started srv02)
> Nov  9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource prmPingd1:0   (Started srv01)
> Nov  9 21:12:35 srv02 pengine: [7208]: notice: LogActions: Leave resource prmPingd1:1   (Started srv02)
> Nov  9 21:12:35 srv02 crmd: [5896]: info: do_state_transition: State transition S_POLICY_ENGINE ->
> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> Nov  9 21:12:35 srv02 crmd: [5896]: info: unpack_graph: Unpacked transition 6: 8 actions in 8 synapses
> Nov  9 21:12:35 srv02 crmd: [5896]: info: do_te_invoke: Processing graph 6
> (ref=pe_calc-dc-1289304755-58) derived from /var/lib/pengine/pe-input-2001.bz2
> Nov  9 21:12:35 srv02 crmd: [5896]: info: te_pseudo_action: Pseudo action 30 fired and confirmed
> Nov  9 21:12:35 srv02 crmd: [5896]: info: te_rsc_command: Initiating action 5: stop Dummy2:0_stop_0 on
> srv01
> Nov  9 21:12:35 srv02 pengine: [7208]: info: process_pe_message: Transition 6: PEngine Input stored
> in: /var/lib/pengine/pe-input-2001.bz2
> Nov  9 21:12:35 srv02 pengine: [7208]: info: process_pe_message: Configuration ERRORs found during PE
> processing.  Please run "crm_verify -L" to identify issues.
> Nov  9 21:12:37 srv02 attrd: [5895]: info: attrd_ha_callback: flush message from srv01
> Nov  9 21:12:37 srv02 crmd: [5896]: info: match_graph_event: Action Dummy2:0_stop_0 (5) confirmed on
> srv01 (rc=0)
> Nov  9 21:12:37 srv02 crmd: [5896]: info: te_pseudo_action: Pseudo action 31 fired and confirmed
> Nov  9 21:12:37 srv02 crmd: [5896]: info: te_pseudo_action: Pseudo action 8 fired and confirmed
> Nov  9 21:12:37 srv02 crmd: [5896]: info: te_pseudo_action: Pseudo action 28 fired and confirmed
> Nov  9 21:12:37 srv02 crmd: [5896]: info: te_rsc_command: Initiating action 24: start Dummy2:0_start_0
> on srv01
>
>  -----> Must not carry out this start.
>
> Nov  9 21:12:37 srv02 crmd: [5896]: info: abort_transition_graph: te_update_diff:146 - Triggered
> transition abort (complete=0, tag=transient_attributes, id=519bb7a2-3c31-414a-b6b2-eaef36a09fb7,
> magic=NA, cib=0.9.41) : Transient attribute: update
> Nov  9 21:12:37 srv02 crmd: [5896]: info: update_abort_priority: Abort priority upgraded from 0 to
> 1000000
> Nov  9 21:12:37 srv02 crmd: [5896]: info: update_abort_priority: Abort action done superceeded by
> restart
> Nov  9 21:12:37 srv02 crmd: [5896]: info: abort_transition_graph: te_update_diff:146 - Triggered
> transition abort (complete=0, tag=transient_attributes, id=519bb7a2-3c31-414a-b6b2-eaef36a09fb7,
> magic=NA, cib=0.9.42) : Transient attribute: update
> (snip)
>
> It seems to be a problem that update of fail-count was late.
> But, this problem seems to occur by a timing.
>
> It affects it in fail over time of the resource that the control number of times of fail-count is
> wrong.
>
> Is this problem already discussed?

Not that I know of

> Is not a delay of the update of fail-count which went by way of attrd a problem?

Indeed.

>
>  * I attach log and some pe-files at Bugzilla.
>  * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2520

Ok, I'll follow up there.

>
> Best Regards,
> Hideo Yamauchi.
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>