[ClusterLabs] crmsh resource failcount does not appear to work

Tue Jan 2 05:38:04 UTC 2018

02.01.2018 06:48, Ken Gaillot пишет:
> On Wed, 2017-12-27 at 14:03 +0300, Andrei Borzenkov wrote:
>> On Wed, Dec 27, 2017 at 11:40 AM, Kristoffer Grönlund
>> <deceiver.g at gmail.com> wrote:
>>>
>>> Andrei Borzenkov <arvidjaar at gmail.com> writes:
>>>
>>>> As far as I can tell, pacemaker acts on failcount attributes
>>>> qualified
>>>> by operation name, while crm sets/queries unqualified attribute;
>>>> I do
>>>> not see any syntax to set fail-count for specific operation in
>>>> crmsh.
>>>
>>> crmsh uses crm_attribute to get the failcount. It could be that
>>> this
>>> usage has stopped working as of 1.1.17..
>>>
>>
>> There is probably misunderstanding. The problem is what attribute is
>> used, not how it is set.  crmsh sets (and as far as I can tell always
>> set) attribute with name fail-count-<resource> while pacemaker
>> internally sets and queries attributes with name
>> fail-count-<resource>#<operation>.
>>
>> It is possible that this has changed in recent pacemaker versions of
>> course ... yep, here is crm_failcount commit that implemented new
>> (per-operation) failcounts. Which means "crm resource failcount set"
>> without qualifying by operation is simply not valid ... actually
>> crm_failcount will refuse to set failcount at all (only clear it).
> 
> Hmm, I didn't realize crm shell supported setting a fail count.
> 
> We discourage setting a fail count attribute directly as of 1.1.17, as
> having a fail count without any failed operation history or last
> failure time can be confusing to users (no failures would show up in
> status, yet failure recovery behavior would be in effect, and failure
> timeouts would not work properly).
> 
> It is possible to set the new per-operation attributes directly, if
> that capability is still desired, but I'm not sure there's a good
> reason to do so.
> 
> crm_failcount is a better choice than crm_attribute for querying and
> clearing fail count attributes, as it will handle summing per-operation 
> fail counts if a resource total fail count is desired. Clearing a fail
> count is now equivalent to crm_resource --cleanup, so it keeps the
> operation history and last failure times consistent.
> 

The problem is that neither "crm resource failcount show" nor "crm
resource failcount delete" work anymore - that is how I hit this issue
in the first place. I do not particularly care whether it is possible to
set failcounts, although I can see it could be useful for testing.

If it is decided to allow setting them, may be crmsh could default to
"monitor" operation if none is explicitly given - that is likely what
most users mean, as during normal run we expect recurring monitor errors.

Although I suppose that crmsh should really be using crm_failcount
which makes support for "set" to be topic of core pacemaker.

> FYI the per-operation fail counts are not particularly useful now, but
> they will make future failure handling enhancements possible, e.g.
> configuring start-failure-is-fatal per resource, or ignoring a certain
> number of monitor failures before recovering while still recovering
> immediately for other operation failures.
> 
>>
>> https://github.com/ClusterLabs/pacemaker/commit/8323616179dc3f8038c6a
>> 69e7323757bd1feacb1#diff-6e58482648938fd488a920b9902daac4
>>
>>
>>>
>>> Cheers,
>>> Kristoffer
>>>
>>>>
>>>> ha1:~ # rpm -q crmsh
>>>> crmsh-4.0.0+git.1511604050.816cb0f5-1.1.noarch
>>>> ha1:~ # crm_mon -1rf
>>>> Stack: corosync
>>>> Current DC: ha2 (version 1.1.17-3.3-36d2962a8) - partition with
>>>> quorum
>>>> Last updated: Sun Dec 24 10:55:54 2017
>>>> Last change: Sun Dec 24 10:55:47 2017 by hacluster via crmd on
>>>> ha2
>>>>
>>>> 2 nodes configured
>>>> 4 resources configured
>>>>
>>>> Online: [ ha1 ha2 ]
>>>>
>>>> Full list of resources:
>>>>
>>>>  stonith-sbd  (stonith:external/sbd): Started ha1
>>>>  rsc_dummy_1  (ocf::pacemaker:Dummy): Started ha2
>>>>  Master/Slave Set: ms_Stateful_1 [rsc_Stateful_1]
>>>>      Masters: [ ha1 ]
>>>>      Slaves: [ ha2 ]
>>>>
>>>> Migration Summary:
>>>> * Node ha2:
>>>> * Node ha1:
>>>> ha1:~ # echo xxx > /run/Stateful-rsc_Stateful_1.state
>>>> ha1:~ # crm_failcount -G -r rsc_Stateful_1
>>>> scope=status  name=fail-count-rsc_Stateful_1 value=1
>>>> ha1:~ # crm resource failcount rsc_Stateful_1 show ha1
>>>> scope=status  name=fail-count-rsc_Stateful_1 value=0
>>>> ha1:~ # crm resource failcount rsc_Stateful_1 set ha1 4
>>>> ha1:~ # crm_failcount -G -r rsc_Stateful_1
>>>> scope=status  name=fail-count-rsc_Stateful_1 value=1
>>>> ha1:~ # crm resource failcount rsc_Stateful_1 show ha1
>>>> scope=status  name=fail-count-rsc_Stateful_1 value=4
>>>> ha1:~ # cibadmin -Q | grep fail-count
>>>>           <nvpair
>>>> id="status-1084752129-fail-count-rsc_Stateful_1.monitor_10000"
>>>> name="fail-count-rsc_Stateful_1#monitor_10000" value="1"/>
>>>>           <nvpair id="status-1084752129-fail-count-
>>>> rsc_Stateful_1"
>>>> name="fail-count-rsc_Stateful_1" value="4"/>
>>>> ha1:~ #
>>>>
>>>> _______________________________________________
>>>> Users mailing list: Users at clusterlabs.org
>>>> http://lists.clusterlabs.org/mailman/listinfo/users
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra
>>>> tch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>
>>> --
>>> // Kristoffer Grönlund
>>> // kgronlund at suse.com
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
>> pdf
>> Bugs: http://bugs.clusterlabs.org