[ClusterLabs] clearing failed actions
Attila Megyeri
amegyeri at minerva-soft.com
Mon Jun 19 17:54:28 EDT 2017
One more thing to add.
Two almost identical clusters, with the identical asterisk primitive produce a different crm_verify output. on one cluster, it returns no warnings, whereas the other once complains:
On the problematic one:
crm_verify --live-check -VV
warning: get_failcount_full: Setting asterisk.failure_timeout=120 in asterisk-stop-0 conflicts with on-fail=block: ignoring timeout
Warnings found during check: config may not be valid
The relevant primitive is in both clusters:
primitive asterisk ocf:heartbeat:asterisk \
op monitor interval="10s" timeout="45s" on-fail="restart" \
op start interval="0" timeout="60s" on-fail="standby" \
op stop interval="0" timeout="60s" on-fail="block" \
meta migration-threshold="3" failure-timeout="2m"
Why is the same configuration valid in one, but not in the other cluster?
Shall I simply omit the "op stop" line?
thanks :)
Attila
> -----Original Message-----
> From: Attila Megyeri [mailto:amegyeri at minerva-soft.com]
> Sent: Monday, June 19, 2017 9:47 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> <users at clusterlabs.org>; kgaillot at redhat.com
> Subject: Re: [ClusterLabs] clearing failed actions
>
> I did another experiment, even simpler.
>
> Created one node, one resource, using pacemaker 1.1.14 on ubuntu.
>
> Configured failcount to 1, migration threshold to 2, failure timeout to 1
> minute.
>
> crm_mon:
>
> Last updated: Mon Jun 19 19:43:41 2017 Last change: Mon Jun 19
> 19:37:09 2017 by root via cibadmin on test
> Stack: corosync
> Current DC: test (version 1.1.14-70404b0) - partition with quorum
> 1 node and 1 resource configured
>
> Online: [ test ]
>
> db-ip-master (ocf::heartbeat:IPaddr2): Started test
>
> Node Attributes:
> * Node test:
>
> Migration Summary:
> * Node test:
> db-ip-master: migration-threshold=2 fail-count=1
>
> crm verify:
>
> crm_verify --live-check -VVVV
> info: validate_with_relaxng: Creating RNG parser context
> info: determine_online_status: Node test is online
> info: get_failcount_full: db-ip-master has failed 1 times on test
> info: get_failcount_full: db-ip-master has failed 1 times on test
> info: get_failcount_full: db-ip-master has failed 1 times on test
> info: get_failcount_full: db-ip-master has failed 1 times on test
> info: native_print: db-ip-master (ocf::heartbeat:IPaddr2): Started test
> info: get_failcount_full: db-ip-master has failed 1 times on test
> info: common_apply_stickiness: db-ip-master can fail 1 more times on
> test before being forced off
> info: LogActions: Leave db-ip-master (Started test)
>
>
> crm configure is:
>
> node 168362242: test \
> attributes standby=off
> primitive db-ip-master IPaddr2 \
> params lvs_support=true ip=10.9.1.10 cidr_netmask=24
> broadcast=10.9.1.255 \
> op start interval=0 timeout=20s on-fail=restart \
> op monitor interval=20s timeout=20s \
> op stop interval=0 timeout=20s on-fail=block \
> meta migration-threshold=2 failure-timeout=1m target-role=Started
> location loc1 db-ip-master 0: test
> property cib-bootstrap-options: \
> have-watchdog=false \
> dc-version=1.1.14-70404b0 \
> cluster-infrastructure=corosync \
> stonith-enabled=false \
> cluster-recheck-interval=30s \
> symmetric-cluster=false
>
>
>
>
> Corosync log:
>
>
> Jun 19 19:45:07 [331] test crmd: notice: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Jun 19 19:45:07 [330] test pengine: info: process_pe_message: Input has
> not changed since last time, not saving to disk
> Jun 19 19:45:07 [330] test pengine: info: determine_online_status:
> Node test is online
> Jun 19 19:45:07 [330] test pengine: info: get_failcount_full: db-ip-master
> has failed 1 times on test
> Jun 19 19:45:07 [330] test pengine: info: get_failcount_full: db-ip-master
> has failed 1 times on test
> Jun 19 19:45:07 [330] test pengine: info: get_failcount_full: db-ip-master
> has failed 1 times on test
> Jun 19 19:45:07 [330] test pengine: info: get_failcount_full: db-ip-master
> has failed 1 times on test
> Jun 19 19:45:07 [330] test pengine: info: native_print: db-ip-master
> (ocf::heartbeat:IPaddr2): Started test
> Jun 19 19:45:07 [330] test pengine: info: get_failcount_full: db-ip-master
> has failed 1 times on test
> Jun 19 19:45:07 [330] test pengine: info: common_apply_stickiness:
> db-ip-master can fail 1 more times on test before being forced off
> Jun 19 19:45:07 [330] test pengine: info: LogActions: Leave db-ip-
> master (Started test)
> Jun 19 19:45:07 [330] test pengine: notice: process_pe_message:
> Calculated Transition 34: /var/lib/pacemaker/pengine/pe-input-6.bz2
> Jun 19 19:45:07 [331] test crmd: info: do_state_transition: State
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [
> input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> Jun 19 19:45:07 [331] test crmd: notice: run_graph: Transition 34
> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-6.bz2): Complete
> Jun 19 19:45:07 [331] test crmd: info: do_log: FSA: Input
> I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE
> Jun 19 19:45:07 [331] test crmd: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
>
>
> I hope someone can help me figure this out :)
>
> Thanks!
>
>
>
> > -----Original Message-----
> > From: Attila Megyeri [mailto:amegyeri at minerva-soft.com]
> > Sent: Monday, June 19, 2017 7:45 PM
> > To: kgaillot at redhat.com; Cluster Labs - All topics related to open-source
> > clustering welcomed <users at clusterlabs.org>
> > Subject: Re: [ClusterLabs] clearing failed actions
> >
> > Hi Ken,
> >
> > /sorry for the long text/
> >
> > I have created a relatively simple setup to localize the issue.
> > Three nodes, no fencing, just a master/slave mysql with two virual IPs.
> > Just as a reminden, my primary issue is, that on cluster recheck intervals,
> tha
> > failcounts are not cleared.
> >
> > I simuated a failure with:
> >
> > crm_failcount -N ctdb1 -r db-ip-master -v 1
> >
> >
> > crm_mon shows:
> >
> > Last updated: Mon Jun 19 17:34:35 2017
> > Last change: Mon Jun 19 17:34:35 2017 via cibadmin on ctmgr
> > Stack: corosync
> > Current DC: ctmgr (168362243) - partition with quorum
> > Version: 1.1.10-42f2063
> > 3 Nodes configured
> > 4 Resources configured
> >
> >
> > Online: [ ctdb1 ctdb2 ctmgr ]
> >
> > db-ip-master (ocf::heartbeat:IPaddr2): Started ctdb1
> > db-ip-slave (ocf::heartbeat:IPaddr2): Started ctdb2
> > Master/Slave Set: mysql [db-mysql]
> > Masters: [ ctdb1 ]
> > Slaves: [ ctdb2 ]
> >
> > Node Attributes:
> > * Node ctdb1:
> > + master-db-mysql : 3601
> > + readable : 1
> > * Node ctdb2:
> > + master-db-mysql : 3600
> > + readable : 1
> > * Node ctmgr:
> >
> > Migration summary:
> > * Node ctmgr:
> > * Node ctdb1:
> > db-ip-master: migration-threshold=1000000 fail-count=1
> > * Node ctdb2:
> >
> >
> >
> > When I check the pacemaker log on the DC, I see the following:
> >
> > Jun 19 17:37:06 [18998] ctmgr crmd: info: crm_timer_popped:
> PEngine
> > Recheck Timer (I_PE_CALC) just popped (30000ms)
> > Jun 19 17:37:06 [18998] ctmgr crmd: debug: s_crmd_fsa: Processing
> > I_PE_CALC: [ state=S_IDLE cause=C_TIMER_POPPED
> > origin=crm_timer_popped ]
> > Jun 19 17:37:06 [18998] ctmgr crmd: notice: do_state_transition:
> State
> > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> > cause=C_TIMER_POPPED origin=crm_timer_popped ]
> > Jun 19 17:37:06 [18998] ctmgr crmd: info: do_state_transition:
> > Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
> > Jun 19 17:37:06 [18998] ctmgr crmd: debug: do_state_transition: All
> 3
> > cluster nodes are eligible to run resources.
> > Jun 19 17:37:06 [18998] ctmgr crmd: debug: do_pe_invoke: Query
> > 231: Requesting the current CIB: S_POLICY_ENGINE
> > Jun 19 17:37:06 [18994] ctmgr cib: info: cib_process_request:
> > Completed cib_query operation for section 'all': OK (rc=0,
> > origin=local/crmd/231, version=0.12.9)
> > Jun 19 17:37:06 [18998] ctmgr crmd: debug: do_pe_invoke_callback:
> > Invoking the PE: query=231, ref=pe_calc-dc-1497893826-144, seq=21884,
> > quorate=1
> > Jun 19 17:37:06 [18997] ctmgr pengine: info: process_pe_message:
> > Input has not changed since last time, not saving to disk
> > Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_config:
> STONITH
> > timeout: 60000
> > Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_config:
> STONITH
> > of failed nodes is disabled
> > Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_config: Stop all
> > active resources: false
> > Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_config: Default
> > stickiness: 0
> > Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_config: On loss
> > of CCM Quorum: Stop ALL resources
> > Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_config: Node
> > scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> > Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_domains:
> > Unpacking domains
> > Jun 19 17:37:06 [18997] ctmgr pengine: info: determine_online_status:
> > Node ctmgr is online
> > Jun 19 17:37:06 [18997] ctmgr pengine: info: determine_online_status:
> > Node ctdb1 is online
> > Jun 19 17:37:06 [18997] ctmgr pengine: info: determine_online_status:
> > Node ctdb2 is online
> > Jun 19 17:37:06 [18997] ctmgr pengine: debug: find_anonymous_clone:
> > Internally renamed db-mysql on ctmgr to db-mysql:0
> > Jun 19 17:37:06 [18997] ctmgr pengine: debug: find_anonymous_clone:
> > Internally renamed db-mysql on ctdb1 to db-mysql:0
> > Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_rsc_op: db-
> > mysql_last_failure_0 on ctdb1 returned 8 (master) instead of the expected
> > value: 7 (not running)
> > Jun 19 17:37:06 [18997] ctmgr pengine: notice: unpack_rsc_op:
> > Operation monitor found resource db-mysql:0 active in master mode on
> > ctdb1
> > Jun 19 17:37:06 [18997] ctmgr pengine: debug: find_anonymous_clone:
> > Internally renamed db-mysql on ctdb2 to db-mysql:1
> > Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_rsc_op: db-
> > mysql_last_failure_0 on ctdb2 returned 0 (ok) instead of the expected
> value:
> > 7 (not running)
> > Jun 19 17:37:06 [18997] ctmgr pengine: info: unpack_rsc_op:
> Operation
> > monitor found resource db-mysql:1 active on ctdb2
> > Jun 19 17:37:06 [18997] ctmgr pengine: info: native_print: db-ip-
> master
> > (ocf::heartbeat:IPaddr2): Started ctdb1
> > Jun 19 17:37:06 [18997] ctmgr pengine: info: native_print: db-ip-slave
> > (ocf::heartbeat:IPaddr2): Started ctdb2
> > Jun 19 17:37:06 [18997] ctmgr pengine: info: clone_print:
> > Master/Slave Set: mysql [db-mysql]
> > Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_active:
> Resource
> > db-mysql:0 active on ctdb1
> > Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_active:
> Resource
> > db-mysql:0 active on ctdb1
> > Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_active:
> Resource
> > db-mysql:1 active on ctdb2
> > Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_active:
> Resource
> > db-mysql:1 active on ctdb2
> > Jun 19 17:37:06 [18997] ctmgr pengine: info: short_print: Masters:
> [
> > ctdb1 ]
> > Jun 19 17:37:06 [18997] ctmgr pengine: info: short_print: Slaves: [
> > ctdb2 ]
> > Jun 19 17:37:06 [18997] ctmgr pengine: info: get_failcount_full: db-
> ip-
> > master has failed 1 times on ctdb1
> > Jun 19 17:37:06 [18997] ctmgr pengine: info: common_apply_stickiness:
> > db-ip-master can fail 999999 more times on ctdb1 before being forced off
> > Jun 19 17:37:06 [18997] ctmgr pengine: debug:
> common_apply_stickiness:
> > Resource db-mysql:0: preferring current location (node=ctdb1, weight=1)
> > Jun 19 17:37:06 [18997] ctmgr pengine: debug:
> common_apply_stickiness:
> > Resource db-mysql:1: preferring current location (node=ctdb2, weight=1)
> > Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_assign_node:
> > Assigning ctdb1 to db-mysql:0
> > Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_assign_node:
> > Assigning ctdb2 to db-mysql:1
> > Jun 19 17:37:06 [18997] ctmgr pengine: debug: clone_color: Allocated
> 2
> > mysql instances of a possible 2
> > Jun 19 17:37:06 [18997] ctmgr pengine: debug: master_color: db-
> > mysql:0 master score: 3601
> > Jun 19 17:37:06 [18997] ctmgr pengine: info: master_color: Promoting
> > db-mysql:0 (Master ctdb1)
> > Jun 19 17:37:06 [18997] ctmgr pengine: debug: master_color: db-
> > mysql:1 master score: 3600
> > Jun 19 17:37:06 [18997] ctmgr pengine: info: master_color: mysql:
> > Promoted 1 instances of a possible 1 to master
> > Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_assign_node:
> > Assigning ctdb1 to db-ip-master
> > Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_assign_node:
> > Assigning ctdb2 to db-ip-slave
> > Jun 19 17:37:06 [18997] ctmgr pengine: debug: master_create_actions:
> > Creating actions for mysql
> > Jun 19 17:37:06 [18997] ctmgr pengine: info: LogActions: Leave db-
> ip-
> > master (Started ctdb1)
> > Jun 19 17:37:06 [18997] ctmgr pengine: info: LogActions: Leave db-
> ip-
> > slave (Started ctdb2)
> > Jun 19 17:37:06 [18997] ctmgr pengine: info: LogActions: Leave db-
> > mysql:0 (Master ctdb1)
> > Jun 19 17:37:06 [18997] ctmgr pengine: info: LogActions: Leave db-
> > mysql:1 (Slave ctdb2)
> > Jun 19 17:37:06 [18997] ctmgr pengine: notice: process_pe_message:
> > Calculated Transition 38: /var/lib/pacemaker/pengine/pe-input-16.bz2
> > Jun 19 17:37:06 [18998] ctmgr crmd: debug: s_crmd_fsa: Processing
> > I_PE_SUCCESS: [ state=S_POLICY_ENGINE cause=C_IPC_MESSAGE
> > origin=handle_response ]
> > Jun 19 17:37:06 [18998] ctmgr crmd: info: do_state_transition:
> State
> > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [
> > input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> > Jun 19 17:37:06 [18998] ctmgr crmd: debug: unpack_graph:
> Unpacked
> > transition 38: 0 actions in 0 synapses
> > Jun 19 17:37:06 [18998] ctmgr crmd: info: do_te_invoke: Processing
> > graph 38 (ref=pe_calc-dc-1497893826-144) derived from
> > /var/lib/pacemaker/pengine/pe-input-16.bz2
> > Jun 19 17:37:06 [18998] ctmgr crmd: debug: print_graph: Empty
> > transition graph
> > Jun 19 17:37:06 [18998] ctmgr crmd: notice: run_graph: Transition 38
> > (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> > Source=/var/lib/pacemaker/pengine/pe-input-16.bz2): Complete
> > Jun 19 17:37:06 [18998] ctmgr crmd: debug: print_graph: Empty
> > transition graph
> > Jun 19 17:37:06 [18998] ctmgr crmd: debug: te_graph_trigger:
> Transition
> > 38 is now complete
> > Jun 19 17:37:06 [18998] ctmgr crmd: debug: notify_crmd: Processing
> > transition completion in state S_TRANSITION_ENGINE
> > Jun 19 17:37:06 [18998] ctmgr crmd: debug: notify_crmd: Transition
> > 38 status: done - <null>
> > Jun 19 17:37:06 [18998] ctmgr crmd: debug: s_crmd_fsa: Processing
> > I_TE_SUCCESS: [ state=S_TRANSITION_ENGINE cause=C_FSA_INTERNAL
> > origin=notify_crmd ]
> > Jun 19 17:37:06 [18998] ctmgr crmd: info: do_log: FSA: Input
> > I_TE_SUCCESS from notify_crmd() received in state
> S_TRANSITION_ENGINE
> > Jun 19 17:37:06 [18998] ctmgr crmd: notice: do_state_transition:
> State
> > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> > cause=C_FSA_INTERNAL origin=notify_crmd ]
> > Jun 19 17:37:06 [18998] ctmgr crmd: debug: do_state_transition:
> > Starting PEngine Recheck Timer
> > Jun 19 17:37:06 [18998] ctmgr crmd: debug: crm_timer_start: Started
> > PEngine Recheck Timer (I_PE_CALC:30000ms), src=277
> >
> >
> >
> > As you can see from the logs, pacemaker does not even try to re-monitor
> the
> > resource that had a failure, or at least I'm not seeing it.
> > Cluster recheck interval is set to 30 seconds for troubleshooting reasons.
> >
> > If I execute a
> >
> > crm resource cleanup db-ip-master
> >
> > Tha failure is removed.
> >
> > Now am I taking something terribly wrong here?
> > Or is this simply a bug in 1.1.10?
> >
> >
> > Thanks,
> > Attila
> >
> >
> >
> >
> > > -----Original Message-----
> > > From: Ken Gaillot [mailto:kgaillot at redhat.com]
> > > Sent: Wednesday, June 7, 2017 10:14 PM
> > > To: Attila Megyeri <amegyeri at minerva-soft.com>; Cluster Labs - All
> topics
> > > related to open-source clustering welcomed <users at clusterlabs.org>
> > > Subject: Re: [ClusterLabs] clearing failed actions
> > >
> > > On 06/01/2017 02:44 PM, Attila Megyeri wrote:
> > > > Ken,
> > > >
> > > > I noticed something strange, this might be the issue.
> > > >
> > > > In some cases, even the manual cleanup does not work.
> > > >
> > > > I have a failed action of resource "A" on node "a". DC is node "b".
> > > >
> > > > e.g.
> > > > Failed actions:
> > > > jboss_imssrv1_monitor_10000 (node=ctims1, call=108, rc=1,
> > > status=complete, last-rc-change=Thu Jun 1 14:13:36 2017
> > > >
> > > >
> > > > When I attempt to do a "crm resource cleanup A" from node "b",
> nothing
> > > happens. Basically the lrmd on "a" is not notified that it should monitor
> the
> > > resource.
> > > >
> > > >
> > > > When I execute a "crm resource cleanup A" command on node "a"
> > (where
> > > the operation failed) , the failed action is cleared properly.
> > > >
> > > > Why could this be happening?
> > > > Which component should be responsible for this? pengine, crmd, lrmd?
> > >
> > > The crm shell will send commands to attrd (to clear fail counts) and
> > > crmd (to clear the resource history), which in turn will record changes
> > > in the cib.
> > >
> > > I'm not sure how crm shell implements it, but crm_resource sends
> > > individual messages to each node when cleaning up a resource without
> > > specifying a particular node. You could check the pacemaker log on each
> > > node to see whether attrd and crmd are receiving those commands, and
> > > what they do in response.
> > >
> > >
> > > >> -----Original Message-----
> > > >> From: Attila Megyeri [mailto:amegyeri at minerva-soft.com]
> > > >> Sent: Thursday, June 1, 2017 6:57 PM
> > > >> To: kgaillot at redhat.com; Cluster Labs - All topics related to open-
> source
> > > >> clustering welcomed <users at clusterlabs.org>
> > > >> Subject: Re: [ClusterLabs] clearing failed actions
> > > >>
> > > >> thanks Ken,
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>> -----Original Message-----
> > > >>> From: Ken Gaillot [mailto:kgaillot at redhat.com]
> > > >>> Sent: Thursday, June 1, 2017 12:04 AM
> > > >>> To: users at clusterlabs.org
> > > >>> Subject: Re: [ClusterLabs] clearing failed actions
> > > >>>
> > > >>> On 05/31/2017 12:17 PM, Ken Gaillot wrote:
> > > >>>> On 05/30/2017 02:50 PM, Attila Megyeri wrote:
> > > >>>>> Hi Ken,
> > > >>>>>
> > > >>>>>
> > > >>>>>> -----Original Message-----
> > > >>>>>> From: Ken Gaillot [mailto:kgaillot at redhat.com]
> > > >>>>>> Sent: Tuesday, May 30, 2017 4:32 PM
> > > >>>>>> To: users at clusterlabs.org
> > > >>>>>> Subject: Re: [ClusterLabs] clearing failed actions
> > > >>>>>>
> > > >>>>>> On 05/30/2017 09:13 AM, Attila Megyeri wrote:
> > > >>>>>>> Hi,
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> Shouldn't the
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> cluster-recheck-interval="2m"
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> property instruct pacemaker to recheck the cluster every 2
> > minutes
> > > >>> and
> > > >>>>>>> clean the failcounts?
> > > >>>>>>
> > > >>>>>> It instructs pacemaker to recalculate whether any actions need
> to
> > be
> > > >>>>>> taken (including expiring any failcounts appropriately).
> > > >>>>>>
> > > >>>>>>> At the primitive level I also have a
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> migration-threshold="30" failure-timeout="2m"
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> but whenever I have a failure, it remains there forever.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> What could be causing this?
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> thanks,
> > > >>>>>>>
> > > >>>>>>> Attila
> > > >>>>>> Is it a single old failure, or a recurring failure? The failure timeout
> > > >>>>>> works in a somewhat nonintuitive way. Old failures are not
> > > individually
> > > >>>>>> expired. Instead, all failures of a resource are simultaneously
> > cleared
> > > >>>>>> if all of them are older than the failure-timeout. So if something
> > > keeps
> > > >>>>>> failing repeatedly (more frequently than the failure-timeout),
> > none
> > > of
> > > >>>>>> the failures will be cleared.
> > > >>>>>>
> > > >>>>>> If it's not a repeating failure, something odd is going on.
> > > >>>>>
> > > >>>>> It is not a repeating failure. Let's say that a resource fails for
> > whatever
> > > >>> action, It will remain in the failed actions (crm_mon -Af) until I issue a
> > > "crm
> > > >>> resource cleanup <resource name>". Even after days or weeks, even
> > > >> though
> > > >>> I see in the logs that cluster is rechecked every 120 seconds.
> > > >>>>>
> > > >>>>> How could I troubleshoot this issue?
> > > >>>>>
> > > >>>>> thanks!
> > > >>>>
> > > >>>>
> > > >>>> Ah, I see what you're saying. That's expected behavior.
> > > >>>>
> > > >>>> The failure-timeout applies to the failure *count* (which is used for
> > > >>>> checking against migration-threshold), not the failure *history*
> > (which
> > > >>>> is used for the status display).
> > > >>>>
> > > >>>> The idea is to have it no longer affect the cluster behavior, but still
> > > >>>> allow an administrator to know that it happened. That's why a
> manual
> > > >>>> cleanup is required to clear the history.
> > > >>>
> > > >>> Hmm, I'm wrong there ... failure-timeout does expire the failure
> > history
> > > >>> used for status display.
> > > >>>
> > > >>> It works with the current versions. It's possible 1.1.10 had issues with
> > > >>> that.
> > > >>>
> > > >>
> > > >> Well if nothing helps I will try to upgrade to a more recent version..
> > > >>
> > > >>
> > > >>
> > > >>> Check the status to see which node is DC, and look at the pacemaker
> > log
> > > >>> there after the failure occurred. There should be a message about
> the
> > > >>> failcount expiring. You can also look at the live CIB and search for
> > > >>> last_failure to see what is used for the display.
> > > >> [AM]
> > > >>
> > > >> In the pacemaker log I see at every recheck interval the following
> lines:
> > > >>
> > > >> Jun 01 16:54:08 [8700] ctabsws2 pengine: warning: unpack_rsc_op:
> > > >> Processing failed op start for jboss_admin2 on ctadmin2: unknown
> error
> > > (1)
> > > >>
> > > >> If I check the CIB for the failure I see:
> > > >>
> > > >> <nvpair id="status-168362322-last-failure-jboss_admin2" name="last-
> > > failure-
> > > >> jboss_admin2" value="1496326649"/>
> > > >> <lrm_rsc_op id="jboss_admin2_last_failure_0"
> > > >> operation_key="jboss_admin2_start_0" operation="start" crm-
> debug-
> > > >> origin="do_update_resource" crm_feature_set="3.0.7" transition-
> > > >> key="73:4:0:0a88f6e6-4ed1-4b53-88ad-3c568ca3daa8" transition-
> > > >> magic="2:1;73:4:0:0a88f6e6-4ed1-4b53-88ad-3c568ca3daa8" call-
> > id="114"
> > > rc-
> > > >> code="1" op-status="2" interval="0" last-run="1496326469" last-rc-
> > > >> change="1496326469" exec-time="180001" queue-time="0" op-
> > > >> digest="8ec02bcea0bab86f4a7e9e27c23bc88b"/>
> > > >>
> > > >>
> > > >> Really have no clue why this isn't cleared...
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list