[ClusterLabs] clearing failed actions

Mon Jun 19 13:45:19 EDT 2017

Hi Ken,

/sorry for the long text/

I have created a relatively simple setup to localize the issue.
Three nodes, no fencing, just a master/slave mysql with two virual IPs.
Just as a reminden, my primary issue is, that on cluster recheck intervals, tha failcounts are not cleared.

I simuated a failure with:

crm_failcount -N ctdb1 -r db-ip-master -v 1

crm_mon shows:

Last updated: Mon Jun 19 17:34:35 2017
Last change: Mon Jun 19 17:34:35 2017 via cibadmin on ctmgr
Stack: corosync
Current DC: ctmgr (168362243) - partition with quorum
Version: 1.1.10-42f2063
3 Nodes configured
4 Resources configured

Online: [ ctdb1 ctdb2 ctmgr ]

db-ip-master    (ocf::heartbeat:IPaddr2):       Started ctdb1
db-ip-slave     (ocf::heartbeat:IPaddr2):       Started ctdb2
 Master/Slave Set: mysql [db-mysql]
     Masters: [ ctdb1 ]
     Slaves: [ ctdb2 ]

Node Attributes:
* Node ctdb1:
    + master-db-mysql                   : 3601
    + readable                          : 1
* Node ctdb2:
    + master-db-mysql                   : 3600
    + readable                          : 1
* Node ctmgr:

Migration summary:
* Node ctmgr:
* Node ctdb1:
   db-ip-master: migration-threshold=1000000 fail-count=1
* Node ctdb2:

When I check the pacemaker log on the DC, I see the following:

Jun 19 17:37:06 [18998] ctmgr       crmd:     info: crm_timer_popped:   PEngine Recheck Timer (I_PE_CALC) just popped (30000ms)
Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: s_crmd_fsa:         Processing I_PE_CALC: [ state=S_IDLE cause=C_TIMER_POPPED origin=crm_timer_popped ]
Jun 19 17:37:06 [18998] ctmgr       crmd:   notice: do_state_transition:        State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
Jun 19 17:37:06 [18998] ctmgr       crmd:     info: do_state_transition:        Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: do_state_transition:        All 3 cluster nodes are eligible to run resources.
Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: do_pe_invoke:       Query 231: Requesting the current CIB: S_POLICY_ENGINE
Jun 19 17:37:06 [18994] ctmgr        cib:     info: cib_process_request:        Completed cib_query operation for section 'all': OK (rc=0, origin=local/crmd/231, version=0.12.9)
Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: do_pe_invoke_callback:      Invoking the PE: query=231, ref=pe_calc-dc-1497893826-144, seq=21884, quorate=1
Jun 19 17:37:06 [18997] ctmgr    pengine:     info: process_pe_message:         Input has not changed since last time, not saving to disk
Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_config:      STONITH timeout: 60000
Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_config:      STONITH of failed nodes is disabled
Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_config:      Stop all active resources: false
Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_config:      Default stickiness: 0
Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_config:      On loss of CCM Quorum: Stop ALL resources
Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_config:      Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_domains:     Unpacking domains
Jun 19 17:37:06 [18997] ctmgr    pengine:     info: determine_online_status:    Node ctmgr is online
Jun 19 17:37:06 [18997] ctmgr    pengine:     info: determine_online_status:    Node ctdb1 is online
Jun 19 17:37:06 [18997] ctmgr    pengine:     info: determine_online_status:    Node ctdb2 is online
Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: find_anonymous_clone:       Internally renamed db-mysql on ctmgr to db-mysql:0
Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: find_anonymous_clone:       Internally renamed db-mysql on ctdb1 to db-mysql:0
Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_rsc_op:      db-mysql_last_failure_0 on ctdb1 returned 8 (master) instead of the expected value: 7 (not running)
Jun 19 17:37:06 [18997] ctmgr    pengine:   notice: unpack_rsc_op:      Operation monitor found resource db-mysql:0 active in master mode on ctdb1
Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: find_anonymous_clone:       Internally renamed db-mysql on ctdb2 to db-mysql:1
Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_rsc_op:      db-mysql_last_failure_0 on ctdb2 returned 0 (ok) instead of the expected value: 7 (not running)
Jun 19 17:37:06 [18997] ctmgr    pengine:     info: unpack_rsc_op:      Operation monitor found resource db-mysql:1 active on ctdb2
Jun 19 17:37:06 [18997] ctmgr    pengine:     info: native_print:       db-ip-master    (ocf::heartbeat:IPaddr2):       Started ctdb1
Jun 19 17:37:06 [18997] ctmgr    pengine:     info: native_print:       db-ip-slave     (ocf::heartbeat:IPaddr2):       Started ctdb2
Jun 19 17:37:06 [18997] ctmgr    pengine:     info: clone_print:         Master/Slave Set: mysql [db-mysql]
Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: native_active:      Resource db-mysql:0 active on ctdb1
Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: native_active:      Resource db-mysql:0 active on ctdb1
Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: native_active:      Resource db-mysql:1 active on ctdb2
Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: native_active:      Resource db-mysql:1 active on ctdb2
Jun 19 17:37:06 [18997] ctmgr    pengine:     info: short_print:             Masters: [ ctdb1 ]
Jun 19 17:37:06 [18997] ctmgr    pengine:     info: short_print:             Slaves: [ ctdb2 ]
Jun 19 17:37:06 [18997] ctmgr    pengine:     info: get_failcount_full:         db-ip-master has failed 1 times on ctdb1
Jun 19 17:37:06 [18997] ctmgr    pengine:     info: common_apply_stickiness:    db-ip-master can fail 999999 more times on ctdb1 before being forced off
Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: common_apply_stickiness:    Resource db-mysql:0: preferring current location (node=ctdb1, weight=1)
Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: common_apply_stickiness:    Resource db-mysql:1: preferring current location (node=ctdb2, weight=1)
Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: native_assign_node:         Assigning ctdb1 to db-mysql:0
Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: native_assign_node:         Assigning ctdb2 to db-mysql:1
Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: clone_color:        Allocated 2 mysql instances of a possible 2
Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: master_color:       db-mysql:0 master score: 3601
Jun 19 17:37:06 [18997] ctmgr    pengine:     info: master_color:       Promoting db-mysql:0 (Master ctdb1)
Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: master_color:       db-mysql:1 master score: 3600
Jun 19 17:37:06 [18997] ctmgr    pengine:     info: master_color:       mysql: Promoted 1 instances of a possible 1 to master
Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: native_assign_node:         Assigning ctdb1 to db-ip-master
Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: native_assign_node:         Assigning ctdb2 to db-ip-slave
Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: master_create_actions:      Creating actions for mysql
Jun 19 17:37:06 [18997] ctmgr    pengine:     info: LogActions:         Leave   db-ip-master    (Started ctdb1)
Jun 19 17:37:06 [18997] ctmgr    pengine:     info: LogActions:         Leave   db-ip-slave     (Started ctdb2)
Jun 19 17:37:06 [18997] ctmgr    pengine:     info: LogActions:         Leave   db-mysql:0      (Master ctdb1)
Jun 19 17:37:06 [18997] ctmgr    pengine:     info: LogActions:         Leave   db-mysql:1      (Slave ctdb2)
Jun 19 17:37:06 [18997] ctmgr    pengine:   notice: process_pe_message:         Calculated Transition 38: /var/lib/pacemaker/pengine/pe-input-16.bz2
Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: s_crmd_fsa:         Processing I_PE_SUCCESS: [ state=S_POLICY_ENGINE cause=C_IPC_MESSAGE origin=handle_response ]
Jun 19 17:37:06 [18998] ctmgr       crmd:     info: do_state_transition:        State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: unpack_graph:       Unpacked transition 38: 0 actions in 0 synapses
Jun 19 17:37:06 [18998] ctmgr       crmd:     info: do_te_invoke:       Processing graph 38 (ref=pe_calc-dc-1497893826-144) derived from /var/lib/pacemaker/pengine/pe-input-16.bz2
Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: print_graph:        Empty transition graph
Jun 19 17:37:06 [18998] ctmgr       crmd:   notice: run_graph:  Transition 38 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-16.bz2): Complete
Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: print_graph:        Empty transition graph
Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: te_graph_trigger:   Transition 38 is now complete
Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: notify_crmd:        Processing transition completion in state S_TRANSITION_ENGINE
Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: notify_crmd:        Transition 38 status: done - <null>
Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: s_crmd_fsa:         Processing I_TE_SUCCESS: [ state=S_TRANSITION_ENGINE cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun 19 17:37:06 [18998] ctmgr       crmd:     info: do_log:     FSA: Input I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE
Jun 19 17:37:06 [18998] ctmgr       crmd:   notice: do_state_transition:        State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: do_state_transition:        Starting PEngine Recheck Timer
Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: crm_timer_start:    Started PEngine Recheck Timer (I_PE_CALC:30000ms), src=277

As you can see from the logs, pacemaker does not even try to re-monitor the resource that had a failure, or at least I'm not seeing it.
Cluster recheck interval is set to 30 seconds for troubleshooting reasons.

If I execute a

crm resource cleanup db-ip-master

Tha failure is removed.

Now am I taking something terribly wrong here?
Or is this simply a bug in 1.1.10?

Thanks,
Attila

> -----Original Message-----
> From: Ken Gaillot [mailto:kgaillot at redhat.com]
> Sent: Wednesday, June 7, 2017 10:14 PM
> To: Attila Megyeri <amegyeri at minerva-soft.com>; Cluster Labs - All topics
> related to open-source clustering welcomed <users at clusterlabs.org>
> Subject: Re: [ClusterLabs] clearing failed actions
>
> On 06/01/2017 02:44 PM, Attila Megyeri wrote:
> > Ken,
> >
> > I noticed something strange, this might be the issue.
> >
> > In some cases, even the manual cleanup does not work.
> >
> > I have a failed action of resource "A" on node "a". DC is node "b".
> >
> > e.g.
> >     Failed actions:
> >     jboss_imssrv1_monitor_10000 (node=ctims1, call=108, rc=1,
> status=complete, last-rc-change=Thu Jun  1 14:13:36 2017
> >
> >
> > When I attempt to do a "crm resource cleanup A" from node "b", nothing
> happens. Basically the lrmd on "a" is not notified that it should monitor the
> resource.
> >
> >
> > When I execute a "crm resource cleanup A" command on node "a" (where
> the operation failed) , the failed action is cleared properly.
> >
> > Why could this be happening?
> > Which component should be responsible for this? pengine, crmd, lrmd?
>
> The crm shell will send commands to attrd (to clear fail counts) and
> crmd (to clear the resource history), which in turn will record changes
> in the cib.
>
> I'm not sure how crm shell implements it, but crm_resource sends
> individual messages to each node when cleaning up a resource without
> specifying a particular node. You could check the pacemaker log on each
> node to see whether attrd and crmd are receiving those commands, and
> what they do in response.
>
>
> >> -----Original Message-----
> >> From: Attila Megyeri [mailto:amegyeri at minerva-soft.com]
> >> Sent: Thursday, June 1, 2017 6:57 PM
> >> To: kgaillot at redhat.com; Cluster Labs - All topics related to open-source
> >> clustering welcomed <users at clusterlabs.org>
> >> Subject: Re: [ClusterLabs] clearing failed actions
> >>
> >> thanks Ken,
> >>
> >>
> >>
> >>
> >>
> >>> -----Original Message-----
> >>> From: Ken Gaillot [mailto:kgaillot at redhat.com]
> >>> Sent: Thursday, June 1, 2017 12:04 AM
> >>> To: users at clusterlabs.org
> >>> Subject: Re: [ClusterLabs] clearing failed actions
> >>>
> >>> On 05/31/2017 12:17 PM, Ken Gaillot wrote:
> >>>> On 05/30/2017 02:50 PM, Attila Megyeri wrote:
> >>>>> Hi Ken,
> >>>>>
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Ken Gaillot [mailto:kgaillot at redhat.com]
> >>>>>> Sent: Tuesday, May 30, 2017 4:32 PM
> >>>>>> To: users at clusterlabs.org
> >>>>>> Subject: Re: [ClusterLabs] clearing failed actions
> >>>>>>
> >>>>>> On 05/30/2017 09:13 AM, Attila Megyeri wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Shouldn't the
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> cluster-recheck-interval="2m"
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> property instruct pacemaker to recheck the cluster every 2 minutes
> >>> and
> >>>>>>> clean the failcounts?
> >>>>>>
> >>>>>> It instructs pacemaker to recalculate whether any actions need to be
> >>>>>> taken (including expiring any failcounts appropriately).
> >>>>>>
> >>>>>>> At the primitive level I also have a
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> migration-threshold="30" failure-timeout="2m"
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> but whenever I have a failure, it remains there forever.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> What could be causing this?
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> thanks,
> >>>>>>>
> >>>>>>> Attila
> >>>>>> Is it a single old failure, or a recurring failure? The failure timeout
> >>>>>> works in a somewhat nonintuitive way. Old failures are not
> individually
> >>>>>> expired. Instead, all failures of a resource are simultaneously cleared
> >>>>>> if all of them are older than the failure-timeout. So if something
> keeps
> >>>>>> failing repeatedly (more frequently than the failure-timeout), none
> of
> >>>>>> the failures will be cleared.
> >>>>>>
> >>>>>> If it's not a repeating failure, something odd is going on.
> >>>>>
> >>>>> It is not a repeating failure. Let's say that a resource fails for whatever
> >>> action, It will remain in the failed actions (crm_mon -Af) until I issue a
> "crm
> >>> resource cleanup <resource name>". Even after days or weeks, even
> >> though
> >>> I see in the logs that cluster is rechecked every 120 seconds.
> >>>>>
> >>>>> How could I troubleshoot this issue?
> >>>>>
> >>>>> thanks!
> >>>>
> >>>>
> >>>> Ah, I see what you're saying. That's expected behavior.
> >>>>
> >>>> The failure-timeout applies to the failure *count* (which is used for
> >>>> checking against migration-threshold), not the failure *history* (which
> >>>> is used for the status display).
> >>>>
> >>>> The idea is to have it no longer affect the cluster behavior, but still
> >>>> allow an administrator to know that it happened. That's why a manual
> >>>> cleanup is required to clear the history.
> >>>
> >>> Hmm, I'm wrong there ... failure-timeout does expire the failure history
> >>> used for status display.
> >>>
> >>> It works with the current versions. It's possible 1.1.10 had issues with
> >>> that.
> >>>
> >>
> >> Well if nothing helps I will try to upgrade to a more recent version..
> >>
> >>
> >>
> >>> Check the status to see which node is DC, and look at the pacemaker log
> >>> there after the failure occurred. There should be a message about the
> >>> failcount expiring. You can also look at the live CIB and search for
> >>> last_failure to see what is used for the display.
> >> [AM]
> >>
> >> In the pacemaker log I see at every recheck interval the following lines:
> >>
> >> Jun 01 16:54:08 [8700] ctabsws2    pengine:  warning: unpack_rsc_op:
> >> Processing failed op start for jboss_admin2 on ctadmin2: unknown error
> (1)
> >>
> >> If I check the  CIB for the failure I see:
> >>
> >> <nvpair id="status-168362322-last-failure-jboss_admin2" name="last-
> failure-
> >> jboss_admin2" value="1496326649"/>
> >>             <lrm_rsc_op id="jboss_admin2_last_failure_0"
> >> operation_key="jboss_admin2_start_0" operation="start" crm-debug-
> >> origin="do_update_resource" crm_feature_set="3.0.7" transition-
> >> key="73:4:0:0a88f6e6-4ed1-4b53-88ad-3c568ca3daa8" transition-
> >> magic="2:1;73:4:0:0a88f6e6-4ed1-4b53-88ad-3c568ca3daa8" call-id="114"
> rc-
> >> code="1" op-status="2" interval="0" last-run="1496326469" last-rc-
> >> change="1496326469" exec-time="180001" queue-time="0" op-
> >> digest="8ec02bcea0bab86f4a7e9e27c23bc88b"/>
> >>
> >>
> >> Really have no clue why this isn't cleared...