[ClusterLabs] crmsh resource failcount does not appear to work

Ken Gaillot kgaillot at redhat.com
Mon Jan 1 22:48:08 EST 2018


On Wed, 2017-12-27 at 14:03 +0300, Andrei Borzenkov wrote:
> On Wed, Dec 27, 2017 at 11:40 AM, Kristoffer Grönlund
> <deceiver.g at gmail.com> wrote:
> > 
> > Andrei Borzenkov <arvidjaar at gmail.com> writes:
> > 
> > > As far as I can tell, pacemaker acts on failcount attributes
> > > qualified
> > > by operation name, while crm sets/queries unqualified attribute;
> > > I do
> > > not see any syntax to set fail-count for specific operation in
> > > crmsh.
> > 
> > crmsh uses crm_attribute to get the failcount. It could be that
> > this
> > usage has stopped working as of 1.1.17..
> > 
> 
> There is probably misunderstanding. The problem is what attribute is
> used, not how it is set.  crmsh sets (and as far as I can tell always
> set) attribute with name fail-count-<resource> while pacemaker
> internally sets and queries attributes with name
> fail-count-<resource>#<operation>.
> 
> It is possible that this has changed in recent pacemaker versions of
> course ... yep, here is crm_failcount commit that implemented new
> (per-operation) failcounts. Which means "crm resource failcount set"
> without qualifying by operation is simply not valid ... actually
> crm_failcount will refuse to set failcount at all (only clear it).

Hmm, I didn't realize crm shell supported setting a fail count.

We discourage setting a fail count attribute directly as of 1.1.17, as
having a fail count without any failed operation history or last
failure time can be confusing to users (no failures would show up in
status, yet failure recovery behavior would be in effect, and failure
timeouts would not work properly).

It is possible to set the new per-operation attributes directly, if
that capability is still desired, but I'm not sure there's a good
reason to do so.

crm_failcount is a better choice than crm_attribute for querying and
clearing fail count attributes, as it will handle summing per-operation 
fail counts if a resource total fail count is desired. Clearing a fail
count is now equivalent to crm_resource --cleanup, so it keeps the
operation history and last failure times consistent.

FYI the per-operation fail counts are not particularly useful now, but
they will make future failure handling enhancements possible, e.g.
configuring start-failure-is-fatal per resource, or ignoring a certain
number of monitor failures before recovering while still recovering
immediately for other operation failures.

> 
> https://github.com/ClusterLabs/pacemaker/commit/8323616179dc3f8038c6a
> 69e7323757bd1feacb1#diff-6e58482648938fd488a920b9902daac4
> 
> 
> > 
> > Cheers,
> > Kristoffer
> > 
> > > 
> > > ha1:~ # rpm -q crmsh
> > > crmsh-4.0.0+git.1511604050.816cb0f5-1.1.noarch
> > > ha1:~ # crm_mon -1rf
> > > Stack: corosync
> > > Current DC: ha2 (version 1.1.17-3.3-36d2962a8) - partition with
> > > quorum
> > > Last updated: Sun Dec 24 10:55:54 2017
> > > Last change: Sun Dec 24 10:55:47 2017 by hacluster via crmd on
> > > ha2
> > > 
> > > 2 nodes configured
> > > 4 resources configured
> > > 
> > > Online: [ ha1 ha2 ]
> > > 
> > > Full list of resources:
> > > 
> > >  stonith-sbd  (stonith:external/sbd): Started ha1
> > >  rsc_dummy_1  (ocf::pacemaker:Dummy): Started ha2
> > >  Master/Slave Set: ms_Stateful_1 [rsc_Stateful_1]
> > >      Masters: [ ha1 ]
> > >      Slaves: [ ha2 ]
> > > 
> > > Migration Summary:
> > > * Node ha2:
> > > * Node ha1:
> > > ha1:~ # echo xxx > /run/Stateful-rsc_Stateful_1.state
> > > ha1:~ # crm_failcount -G -r rsc_Stateful_1
> > > scope=status  name=fail-count-rsc_Stateful_1 value=1
> > > ha1:~ # crm resource failcount rsc_Stateful_1 show ha1
> > > scope=status  name=fail-count-rsc_Stateful_1 value=0
> > > ha1:~ # crm resource failcount rsc_Stateful_1 set ha1 4
> > > ha1:~ # crm_failcount -G -r rsc_Stateful_1
> > > scope=status  name=fail-count-rsc_Stateful_1 value=1
> > > ha1:~ # crm resource failcount rsc_Stateful_1 show ha1
> > > scope=status  name=fail-count-rsc_Stateful_1 value=4
> > > ha1:~ # cibadmin -Q | grep fail-count
> > >           <nvpair
> > > id="status-1084752129-fail-count-rsc_Stateful_1.monitor_10000"
> > > name="fail-count-rsc_Stateful_1#monitor_10000" value="1"/>
> > >           <nvpair id="status-1084752129-fail-count-
> > > rsc_Stateful_1"
> > > name="fail-count-rsc_Stateful_1" value="4"/>
> > > ha1:~ #
> > > 
> > > _______________________________________________
> > > Users mailing list: Users at clusterlabs.org
> > > http://lists.clusterlabs.org/mailman/listinfo/users
> > > 
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra
> > > tch.pdf
> > > Bugs: http://bugs.clusterlabs.org
> > > 
> > 
> > --
> > // Kristoffer Grönlund
> > // kgronlund at suse.com
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot <kgaillot at redhat.com>




More information about the Users mailing list