[ClusterLabs] clearing failed actions

Thu Jun 1 15:44:28 EDT 2017

Ken,

I noticed something strange, this might be the issue.

In some cases, even the manual cleanup does not work.

I have a failed action of resource "A" on node "a". DC is node "b".

e.g.
	Failed actions:
    jboss_imssrv1_monitor_10000 (node=ctims1, call=108, rc=1, status=complete, last-rc-change=Thu Jun  1 14:13:36 2017

When I attempt to do a "crm resource cleanup A" from node "b", nothing happens. Basically the lrmd on "a" is not notified that it should monitor the resource.

When I execute a "crm resource cleanup A" command on node "a" (where the operation failed) , the failed action is cleared properly.

Why could this be happening?
Which component should be responsible for this? pengine, crmd, lrmd?

> -----Original Message-----
> From: Attila Megyeri [mailto:amegyeri at minerva-soft.com]
> Sent: Thursday, June 1, 2017 6:57 PM
> To: kgaillot at redhat.com; Cluster Labs - All topics related to open-source
> clustering welcomed <users at clusterlabs.org>
> Subject: Re: [ClusterLabs] clearing failed actions
> 
> thanks Ken,
> 
> 
> 
> 
> 
> > -----Original Message-----
> > From: Ken Gaillot [mailto:kgaillot at redhat.com]
> > Sent: Thursday, June 1, 2017 12:04 AM
> > To: users at clusterlabs.org
> > Subject: Re: [ClusterLabs] clearing failed actions
> >
> > On 05/31/2017 12:17 PM, Ken Gaillot wrote:
> > > On 05/30/2017 02:50 PM, Attila Megyeri wrote:
> > >> Hi Ken,
> > >>
> > >>
> > >>> -----Original Message-----
> > >>> From: Ken Gaillot [mailto:kgaillot at redhat.com]
> > >>> Sent: Tuesday, May 30, 2017 4:32 PM
> > >>> To: users at clusterlabs.org
> > >>> Subject: Re: [ClusterLabs] clearing failed actions
> > >>>
> > >>> On 05/30/2017 09:13 AM, Attila Megyeri wrote:
> > >>>> Hi,
> > >>>>
> > >>>>
> > >>>>
> > >>>> Shouldn't the
> > >>>>
> > >>>>
> > >>>>
> > >>>> cluster-recheck-interval="2m"
> > >>>>
> > >>>>
> > >>>>
> > >>>> property instruct pacemaker to recheck the cluster every 2 minutes
> > and
> > >>>> clean the failcounts?
> > >>>
> > >>> It instructs pacemaker to recalculate whether any actions need to be
> > >>> taken (including expiring any failcounts appropriately).
> > >>>
> > >>>> At the primitive level I also have a
> > >>>>
> > >>>>
> > >>>>
> > >>>> migration-threshold="30" failure-timeout="2m"
> > >>>>
> > >>>>
> > >>>>
> > >>>> but whenever I have a failure, it remains there forever.
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> What could be causing this?
> > >>>>
> > >>>>
> > >>>>
> > >>>> thanks,
> > >>>>
> > >>>> Attila
> > >>> Is it a single old failure, or a recurring failure? The failure timeout
> > >>> works in a somewhat nonintuitive way. Old failures are not individually
> > >>> expired. Instead, all failures of a resource are simultaneously cleared
> > >>> if all of them are older than the failure-timeout. So if something keeps
> > >>> failing repeatedly (more frequently than the failure-timeout), none of
> > >>> the failures will be cleared.
> > >>>
> > >>> If it's not a repeating failure, something odd is going on.
> > >>
> > >> It is not a repeating failure. Let's say that a resource fails for whatever
> > action, It will remain in the failed actions (crm_mon -Af) until I issue a "crm
> > resource cleanup <resource name>". Even after days or weeks, even
> though
> > I see in the logs that cluster is rechecked every 120 seconds.
> > >>
> > >> How could I troubleshoot this issue?
> > >>
> > >> thanks!
> > >
> > >
> > > Ah, I see what you're saying. That's expected behavior.
> > >
> > > The failure-timeout applies to the failure *count* (which is used for
> > > checking against migration-threshold), not the failure *history* (which
> > > is used for the status display).
> > >
> > > The idea is to have it no longer affect the cluster behavior, but still
> > > allow an administrator to know that it happened. That's why a manual
> > > cleanup is required to clear the history.
> >
> > Hmm, I'm wrong there ... failure-timeout does expire the failure history
> > used for status display.
> >
> > It works with the current versions. It's possible 1.1.10 had issues with
> > that.
> >
> 
> Well if nothing helps I will try to upgrade to a more recent version..
> 
> 
> 
> > Check the status to see which node is DC, and look at the pacemaker log
> > there after the failure occurred. There should be a message about the
> > failcount expiring. You can also look at the live CIB and search for
> > last_failure to see what is used for the display.
> [AM]
> 
> In the pacemaker log I see at every recheck interval the following lines:
> 
> Jun 01 16:54:08 [8700] ctabsws2    pengine:  warning: unpack_rsc_op:
> Processing failed op start for jboss_admin2 on ctadmin2: unknown error (1)
> 
> If I check the  CIB for the failure I see:
> 
> <nvpair id="status-168362322-last-failure-jboss_admin2" name="last-failure-
> jboss_admin2" value="1496326649"/>
>             <lrm_rsc_op id="jboss_admin2_last_failure_0"
> operation_key="jboss_admin2_start_0" operation="start" crm-debug-
> origin="do_update_resource" crm_feature_set="3.0.7" transition-
> key="73:4:0:0a88f6e6-4ed1-4b53-88ad-3c568ca3daa8" transition-
> magic="2:1;73:4:0:0a88f6e6-4ed1-4b53-88ad-3c568ca3daa8" call-id="114" rc-
> code="1" op-status="2" interval="0" last-run="1496326469" last-rc-
> change="1496326469" exec-time="180001" queue-time="0" op-
> digest="8ec02bcea0bab86f4a7e9e27c23bc88b"/>
> 
> 
> Really have no clue why this isn't cleared...
> 
> 
> 
> >
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org