[ClusterLabs] clearing failed actions
Attila Megyeri
amegyeri at minerva-soft.com
Mon Jun 19 13:45:19 EDT 2017
Hi Ken,
/sorry for the long text/
I have created a relatively simple setup to localize the issue.
Three nodes, no fencing, just a master/slave mysql with two virual IPs.
Just as a reminden, my primary issue is, that on cluster recheck intervals, tha failcounts are not cleared.
I simuated a failure with:
crm_failcount -N ctdb1 -r db-ip-master -v 1
crm_mon shows:
Last updated: Mon Jun 19 17:34:35 2017
Last change: Mon Jun 19 17:34:35 2017 via cibadmin on ctmgr
Stack: corosync
Current DC: ctmgr (168362243) - partition with quorum
Version: 1.1.10-42f2063
3 Nodes configured
4 Resources configured
Online: [ ctdb1 ctdb2 ctmgr ]
db-ip-master (ocf::heartbeat:IPaddr2): Started ctdb1
db-ip-slave (ocf::heartbeat:IPaddr2): Started ctdb2
Master/Slave Set: mysql [db-mysql]
Masters: [ ctdb1 ]
Slaves: [ ctdb2 ]
Node Attributes:
* Node ctdb1:
+ master-db-mysql : 3601
+ readable : 1
* Node ctdb2:
+ master-db-mysql : 3600
+ readable : 1
* Node ctmgr:
Migration summary:
* Node ctmgr:
* Node ctdb1:
db-ip-master: migration-threshold=1000000 fail-count=1
* Node ctdb2:
When I check the pacemaker log on the DC, I see the following:
Jun 19 17:37:06 [18998] ctmgr crmd: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (30000ms)
Jun 19 17:37:06 [18998] ctmgr crmd: debug: s_crmd_fsa: Processing I_PE_CALC: [ state=S_IDLE cause=C_TIMER_POPPED origin=crm_timer_popped ]
Jun 19 17:37:06 [18998] ctmgr crmd: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
Jun 19 17:37:06 [18998] ctmgr crmd: info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
Jun 19 17:37:06 [18998] ctmgr crmd: debug: do_state_transition: All 3 cluster nodes are eligible to run resources.
Jun 19 17:37:06 [18998] ctmgr crmd: debug: do_pe_invoke: Query 231: Requesting the current CIB: S_POLICY_ENGINE
Jun 19 17:37:06 [18994] ctmgr cib: info: cib_process_request: Completed cib_query operation for section 'all': OK (rc=0, origin=local/crmd/231, version=0.12.9)
Jun 19 17:37:06 [18998] ctmgr crmd: debug: do_pe_invoke_callback: Invoking the PE: query=231, ref=pe_calc-dc-1497893826-144, seq=21884, quorate=1
Jun 19 17:37:06 [18997] ctmgr pengine: info: process_pe_message: Input has not changed since last time, not saving to disk
Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_config: STONITH timeout: 60000
Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_config: STONITH of failed nodes is disabled
Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_config: Stop all active resources: false
Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_config: Default stickiness: 0
Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_config: On loss of CCM Quorum: Stop ALL resources
Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_domains: Unpacking domains
Jun 19 17:37:06 [18997] ctmgr pengine: info: determine_online_status: Node ctmgr is online
Jun 19 17:37:06 [18997] ctmgr pengine: info: determine_online_status: Node ctdb1 is online
Jun 19 17:37:06 [18997] ctmgr pengine: info: determine_online_status: Node ctdb2 is online
Jun 19 17:37:06 [18997] ctmgr pengine: debug: find_anonymous_clone: Internally renamed db-mysql on ctmgr to db-mysql:0
Jun 19 17:37:06 [18997] ctmgr pengine: debug: find_anonymous_clone: Internally renamed db-mysql on ctdb1 to db-mysql:0
Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_rsc_op: db-mysql_last_failure_0 on ctdb1 returned 8 (master) instead of the expected value: 7 (not running)
Jun 19 17:37:06 [18997] ctmgr pengine: notice: unpack_rsc_op: Operation monitor found resource db-mysql:0 active in master mode on ctdb1
Jun 19 17:37:06 [18997] ctmgr pengine: debug: find_anonymous_clone: Internally renamed db-mysql on ctdb2 to db-mysql:1
Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_rsc_op: db-mysql_last_failure_0 on ctdb2 returned 0 (ok) instead of the expected value: 7 (not running)
Jun 19 17:37:06 [18997] ctmgr pengine: info: unpack_rsc_op: Operation monitor found resource db-mysql:1 active on ctdb2
Jun 19 17:37:06 [18997] ctmgr pengine: info: native_print: db-ip-master (ocf::heartbeat:IPaddr2): Started ctdb1
Jun 19 17:37:06 [18997] ctmgr pengine: info: native_print: db-ip-slave (ocf::heartbeat:IPaddr2): Started ctdb2
Jun 19 17:37:06 [18997] ctmgr pengine: info: clone_print: Master/Slave Set: mysql [db-mysql]
Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_active: Resource db-mysql:0 active on ctdb1
Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_active: Resource db-mysql:0 active on ctdb1
Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_active: Resource db-mysql:1 active on ctdb2
Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_active: Resource db-mysql:1 active on ctdb2
Jun 19 17:37:06 [18997] ctmgr pengine: info: short_print: Masters: [ ctdb1 ]
Jun 19 17:37:06 [18997] ctmgr pengine: info: short_print: Slaves: [ ctdb2 ]
Jun 19 17:37:06 [18997] ctmgr pengine: info: get_failcount_full: db-ip-master has failed 1 times on ctdb1
Jun 19 17:37:06 [18997] ctmgr pengine: info: common_apply_stickiness: db-ip-master can fail 999999 more times on ctdb1 before being forced off
Jun 19 17:37:06 [18997] ctmgr pengine: debug: common_apply_stickiness: Resource db-mysql:0: preferring current location (node=ctdb1, weight=1)
Jun 19 17:37:06 [18997] ctmgr pengine: debug: common_apply_stickiness: Resource db-mysql:1: preferring current location (node=ctdb2, weight=1)
Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_assign_node: Assigning ctdb1 to db-mysql:0
Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_assign_node: Assigning ctdb2 to db-mysql:1
Jun 19 17:37:06 [18997] ctmgr pengine: debug: clone_color: Allocated 2 mysql instances of a possible 2
Jun 19 17:37:06 [18997] ctmgr pengine: debug: master_color: db-mysql:0 master score: 3601
Jun 19 17:37:06 [18997] ctmgr pengine: info: master_color: Promoting db-mysql:0 (Master ctdb1)
Jun 19 17:37:06 [18997] ctmgr pengine: debug: master_color: db-mysql:1 master score: 3600
Jun 19 17:37:06 [18997] ctmgr pengine: info: master_color: mysql: Promoted 1 instances of a possible 1 to master
Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_assign_node: Assigning ctdb1 to db-ip-master
Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_assign_node: Assigning ctdb2 to db-ip-slave
Jun 19 17:37:06 [18997] ctmgr pengine: debug: master_create_actions: Creating actions for mysql
Jun 19 17:37:06 [18997] ctmgr pengine: info: LogActions: Leave db-ip-master (Started ctdb1)
Jun 19 17:37:06 [18997] ctmgr pengine: info: LogActions: Leave db-ip-slave (Started ctdb2)
Jun 19 17:37:06 [18997] ctmgr pengine: info: LogActions: Leave db-mysql:0 (Master ctdb1)
Jun 19 17:37:06 [18997] ctmgr pengine: info: LogActions: Leave db-mysql:1 (Slave ctdb2)
Jun 19 17:37:06 [18997] ctmgr pengine: notice: process_pe_message: Calculated Transition 38: /var/lib/pacemaker/pengine/pe-input-16.bz2
Jun 19 17:37:06 [18998] ctmgr crmd: debug: s_crmd_fsa: Processing I_PE_SUCCESS: [ state=S_POLICY_ENGINE cause=C_IPC_MESSAGE origin=handle_response ]
Jun 19 17:37:06 [18998] ctmgr crmd: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Jun 19 17:37:06 [18998] ctmgr crmd: debug: unpack_graph: Unpacked transition 38: 0 actions in 0 synapses
Jun 19 17:37:06 [18998] ctmgr crmd: info: do_te_invoke: Processing graph 38 (ref=pe_calc-dc-1497893826-144) derived from /var/lib/pacemaker/pengine/pe-input-16.bz2
Jun 19 17:37:06 [18998] ctmgr crmd: debug: print_graph: Empty transition graph
Jun 19 17:37:06 [18998] ctmgr crmd: notice: run_graph: Transition 38 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-16.bz2): Complete
Jun 19 17:37:06 [18998] ctmgr crmd: debug: print_graph: Empty transition graph
Jun 19 17:37:06 [18998] ctmgr crmd: debug: te_graph_trigger: Transition 38 is now complete
Jun 19 17:37:06 [18998] ctmgr crmd: debug: notify_crmd: Processing transition completion in state S_TRANSITION_ENGINE
Jun 19 17:37:06 [18998] ctmgr crmd: debug: notify_crmd: Transition 38 status: done - <null>
Jun 19 17:37:06 [18998] ctmgr crmd: debug: s_crmd_fsa: Processing I_TE_SUCCESS: [ state=S_TRANSITION_ENGINE cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun 19 17:37:06 [18998] ctmgr crmd: info: do_log: FSA: Input I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE
Jun 19 17:37:06 [18998] ctmgr crmd: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun 19 17:37:06 [18998] ctmgr crmd: debug: do_state_transition: Starting PEngine Recheck Timer
Jun 19 17:37:06 [18998] ctmgr crmd: debug: crm_timer_start: Started PEngine Recheck Timer (I_PE_CALC:30000ms), src=277
As you can see from the logs, pacemaker does not even try to re-monitor the resource that had a failure, or at least I'm not seeing it.
Cluster recheck interval is set to 30 seconds for troubleshooting reasons.
If I execute a
crm resource cleanup db-ip-master
Tha failure is removed.
Now am I taking something terribly wrong here?
Or is this simply a bug in 1.1.10?
Thanks,
Attila
> -----Original Message-----
> From: Ken Gaillot [mailto:kgaillot at redhat.com]
> Sent: Wednesday, June 7, 2017 10:14 PM
> To: Attila Megyeri <amegyeri at minerva-soft.com>; Cluster Labs - All topics
> related to open-source clustering welcomed <users at clusterlabs.org>
> Subject: Re: [ClusterLabs] clearing failed actions
>
> On 06/01/2017 02:44 PM, Attila Megyeri wrote:
> > Ken,
> >
> > I noticed something strange, this might be the issue.
> >
> > In some cases, even the manual cleanup does not work.
> >
> > I have a failed action of resource "A" on node "a". DC is node "b".
> >
> > e.g.
> > Failed actions:
> > jboss_imssrv1_monitor_10000 (node=ctims1, call=108, rc=1,
> status=complete, last-rc-change=Thu Jun 1 14:13:36 2017
> >
> >
> > When I attempt to do a "crm resource cleanup A" from node "b", nothing
> happens. Basically the lrmd on "a" is not notified that it should monitor the
> resource.
> >
> >
> > When I execute a "crm resource cleanup A" command on node "a" (where
> the operation failed) , the failed action is cleared properly.
> >
> > Why could this be happening?
> > Which component should be responsible for this? pengine, crmd, lrmd?
>
> The crm shell will send commands to attrd (to clear fail counts) and
> crmd (to clear the resource history), which in turn will record changes
> in the cib.
>
> I'm not sure how crm shell implements it, but crm_resource sends
> individual messages to each node when cleaning up a resource without
> specifying a particular node. You could check the pacemaker log on each
> node to see whether attrd and crmd are receiving those commands, and
> what they do in response.
>
>
> >> -----Original Message-----
> >> From: Attila Megyeri [mailto:amegyeri at minerva-soft.com]
> >> Sent: Thursday, June 1, 2017 6:57 PM
> >> To: kgaillot at redhat.com; Cluster Labs - All topics related to open-source
> >> clustering welcomed <users at clusterlabs.org>
> >> Subject: Re: [ClusterLabs] clearing failed actions
> >>
> >> thanks Ken,
> >>
> >>
> >>
> >>
> >>
> >>> -----Original Message-----
> >>> From: Ken Gaillot [mailto:kgaillot at redhat.com]
> >>> Sent: Thursday, June 1, 2017 12:04 AM
> >>> To: users at clusterlabs.org
> >>> Subject: Re: [ClusterLabs] clearing failed actions
> >>>
> >>> On 05/31/2017 12:17 PM, Ken Gaillot wrote:
> >>>> On 05/30/2017 02:50 PM, Attila Megyeri wrote:
> >>>>> Hi Ken,
> >>>>>
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Ken Gaillot [mailto:kgaillot at redhat.com]
> >>>>>> Sent: Tuesday, May 30, 2017 4:32 PM
> >>>>>> To: users at clusterlabs.org
> >>>>>> Subject: Re: [ClusterLabs] clearing failed actions
> >>>>>>
> >>>>>> On 05/30/2017 09:13 AM, Attila Megyeri wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Shouldn't the
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> cluster-recheck-interval="2m"
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> property instruct pacemaker to recheck the cluster every 2 minutes
> >>> and
> >>>>>>> clean the failcounts?
> >>>>>>
> >>>>>> It instructs pacemaker to recalculate whether any actions need to be
> >>>>>> taken (including expiring any failcounts appropriately).
> >>>>>>
> >>>>>>> At the primitive level I also have a
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> migration-threshold="30" failure-timeout="2m"
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> but whenever I have a failure, it remains there forever.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> What could be causing this?
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> thanks,
> >>>>>>>
> >>>>>>> Attila
> >>>>>> Is it a single old failure, or a recurring failure? The failure timeout
> >>>>>> works in a somewhat nonintuitive way. Old failures are not
> individually
> >>>>>> expired. Instead, all failures of a resource are simultaneously cleared
> >>>>>> if all of them are older than the failure-timeout. So if something
> keeps
> >>>>>> failing repeatedly (more frequently than the failure-timeout), none
> of
> >>>>>> the failures will be cleared.
> >>>>>>
> >>>>>> If it's not a repeating failure, something odd is going on.
> >>>>>
> >>>>> It is not a repeating failure. Let's say that a resource fails for whatever
> >>> action, It will remain in the failed actions (crm_mon -Af) until I issue a
> "crm
> >>> resource cleanup <resource name>". Even after days or weeks, even
> >> though
> >>> I see in the logs that cluster is rechecked every 120 seconds.
> >>>>>
> >>>>> How could I troubleshoot this issue?
> >>>>>
> >>>>> thanks!
> >>>>
> >>>>
> >>>> Ah, I see what you're saying. That's expected behavior.
> >>>>
> >>>> The failure-timeout applies to the failure *count* (which is used for
> >>>> checking against migration-threshold), not the failure *history* (which
> >>>> is used for the status display).
> >>>>
> >>>> The idea is to have it no longer affect the cluster behavior, but still
> >>>> allow an administrator to know that it happened. That's why a manual
> >>>> cleanup is required to clear the history.
> >>>
> >>> Hmm, I'm wrong there ... failure-timeout does expire the failure history
> >>> used for status display.
> >>>
> >>> It works with the current versions. It's possible 1.1.10 had issues with
> >>> that.
> >>>
> >>
> >> Well if nothing helps I will try to upgrade to a more recent version..
> >>
> >>
> >>
> >>> Check the status to see which node is DC, and look at the pacemaker log
> >>> there after the failure occurred. There should be a message about the
> >>> failcount expiring. You can also look at the live CIB and search for
> >>> last_failure to see what is used for the display.
> >> [AM]
> >>
> >> In the pacemaker log I see at every recheck interval the following lines:
> >>
> >> Jun 01 16:54:08 [8700] ctabsws2 pengine: warning: unpack_rsc_op:
> >> Processing failed op start for jboss_admin2 on ctadmin2: unknown error
> (1)
> >>
> >> If I check the CIB for the failure I see:
> >>
> >> <nvpair id="status-168362322-last-failure-jboss_admin2" name="last-
> failure-
> >> jboss_admin2" value="1496326649"/>
> >> <lrm_rsc_op id="jboss_admin2_last_failure_0"
> >> operation_key="jboss_admin2_start_0" operation="start" crm-debug-
> >> origin="do_update_resource" crm_feature_set="3.0.7" transition-
> >> key="73:4:0:0a88f6e6-4ed1-4b53-88ad-3c568ca3daa8" transition-
> >> magic="2:1;73:4:0:0a88f6e6-4ed1-4b53-88ad-3c568ca3daa8" call-id="114"
> rc-
> >> code="1" op-status="2" interval="0" last-run="1496326469" last-rc-
> >> change="1496326469" exec-time="180001" queue-time="0" op-
> >> digest="8ec02bcea0bab86f4a7e9e27c23bc88b"/>
> >>
> >>
> >> Really have no clue why this isn't cleared...
More information about the Users
mailing list