[ClusterLabs] monitor failed actions not cleared

Wed Oct 18 16:49:11 EDT 2017

On Mon, 2017-10-02 at 13:29 +0000, LE COQUIL Pierre-Yves wrote:
> Hi,
>  
> I finally found my mistake:
> I have set up the failure-timeout like the lifetime example in the
> RedHat Documentation with the value PT1M.
> If I set up the failure-timeout with 60, it works like it should.

This is a bug somewhere in pacemaker. I recently got a bug report
related to recurring monitors, so I'm taking a closer look at time
interval handling in general. I'll make sure to figure out where this
one is.

>  
> Just trying a last question …:
> Couldn’t it be something in the log telling the value isn’t at the
> right format ?

Definitely, it should ... though in this case, it should parse PT1M
correctly to begin with.

> Pierre-Yves
>  
>  
> De : LE COQUIL Pierre-Yves 
> Envoyé : mercredi 27 septembre 2017 19:37
> À : 'users at clusterlabs.org' <users at clusterlabs.org>
> Objet : RE: monitor failed actions not cleared
>  
>  
>  
> De : LE COQUIL Pierre-Yves 
> Envoyé : lundi 25 septembre 2017 16:58
> À : 'users at clusterlabs.org' <users at clusterlabs.org>
> Objet : monitor failed actions not cleared
>  
> Hi,
>  
> I’am using Pacemaker  1.1.15-11.el7_3.4 / Corosync 2.4.0-4.el7 under
> CentOS 7.3.1611
>  
> ð  Is this configuration too old ? (yum indicates these versions are
> up to date)

No, those are recent versions. CentOS 7.4 has slightly newer versions,
but there's nothing wrong with staying on those for now.

> ð  Should I install more recent versions of Pacemaker and Corosync ?
>  
> My subject is very close to the post “clearing failed actions”
> initiated by Attila Megyeri in May 2017.
> But the issue doesn’t fit my case.
>  
> What I want to do is:
> -          2 systemd resources running on 1 of the 2 nodes of my
> cluster,
> -          When  1 resource fails (by killing it or by moving the
> resource), I want it to be restarted on the other node, but I want
> the other resource still running on the same node.
>  
> ð  Is this possible with Pacemaker ?
>  
> What I have done in addition to the default parameters:
> -          For my resources:
> o   migration-threshold=1,
> o   failure-timeout=PT1M
> -          For the cluster
> o   Cluster-recheck-interval=120
>  
> I have added for my resource operation monitor: on-fail=restart
> (which is the default)
>  
> I do not use Fencing (Stonith Enabled = false)
> ð  Is Fencing compatible with my goal ?

Yes, fencing should be considered a requirement for a stable cluster.

Fencing handles node-level failures rather than resource-level
failures. If a node becomes unresponsive, the rest of the cluster can't
know whether it is inoperational (and thus unable to pose any conflict)
or just misbehaving (perhaps the CPU is overloaded, or a network card
went out, or ...) in which case it's not safe to recover resources
elsewhere. Fencing makes it certain it's safe.

> What happens:
> -          When I kill or move 1 resource, it is restarted on the
> other node => OK
> -          The failcount is incremented to 1 for this resource => OK
> -          The failcount is never cleared => NOK
>  
> ð  I get a warning in the log :
> “pengine:  warning: unpack_rsc_op_failure:        Processing failed
> op monitor for ACTIVATION_KX on metro.cas-n1: not running (7)”
> when my resource  ACTIVATION_KX has been killed on node  metro.cas-n1
> but pcs status shows ACTIVATION_KX is started on the other node

It's a longstanding to-do to improve this message ... it doesn't
(necessarily) mean any new failure has occurred. It just means the
policy engine is processing the resource history, which includes a
failure (which could be recent, or old). The log message will show up
every time the policy engine runs, and continue to be displayed in the
status failure history, until you clean the resource.

> ð  Is it a bad monitor operation configuration for my resource ? (I
> have added “requires= nothing”)

Your configuration is fine, although "requires" has no effect in a
monitor operation. It's only relevant for start and promote operations,
and even then, it's deprecated to set it in the operation configuration
... it belongs in the resource configuration now. "requires=nothing" is
highly unlikely to be what you want, though; the default is usually
sufficient.

> I know that my english and my pacemaker knowledge are not so high but
> could you please give me some explanations about that behavior that I
> misunderstand.

Not at all, this was a very clear and well-thought-out post :)

> ð  If something is wrong with my post, just tell me (this is my
> first)
>  
> Thank you
>  
> Thanks
>  
> Pierre-Yves Le Coquil
-- 
Ken Gaillot <kgaillot at redhat.com>