[ClusterLabs] When resource fails to start it stops an apparently unrelated resource

Wed Oct 18 10:42:44 EDT 2017

On Wed, 2017-10-18 at 14:25 +0200, Gerard Garcia wrote:
> So I think I found the problem. The two resources are named forwarder
> and bgpforwarder. It doesn't matter if bgpforwarder exists. It is
> just that when I set the failcount to INFINITY to a resource named
> bgpforwarder (crm_failcount -r bgpforwarder -v INFINITY) it directly
> affects the forwarder resource. 
> 
> If I change the name to forwarderbgp, the problem disappears. So it
> seems that the problem is that Pacemaker mixes the bgpforwarder and
> forwarder names. Is it a bug?
> 
> Gerard

That's really surprising. What version of pacemaker are you using?
There were a lot of changes in fail count handling in the last few
releases.

> 
> On Tue, Oct 17, 2017 at 6:27 PM, Gerard Garcia <gerard at talaia.io>
> wrote:
> > That makes sense. I've tried copying the anything resource and
> > changed its name and id (which I guess should be enough to make
> > pacemaker think they are different) but I still have the same
> > problem.
> > 
> > After more debugging I have reduced the problem to this:
> > * First cloned resource running fine
> > * Second cloned resource running fine
> > * Manually set failcount to INFINITY to second cloned resource
> > * Pacemaker triggers an stop operation (without monitor operation
> > failing) for the two resources in the node where the failcount has
> > been set to INFINITY.
> > * Reset failcount starts the two resources again
> > 
> > Weirdly enough the second resource doesn't stop if I set the the
> > the first resource failcount to INFINITY (not even the first
> > resource stops...). 
> > 
> > But:
> > * If I set the first resource as globally-unique=true it does not
> > stop so somehow this breaks the relation.
> > * If I manually set the failcount to 0 in the first resource that
> > also breaks the relation so it does not stop either. It seems like
> > the failcount value is being inherited from the second resource
> > when it does not have any value. 
> > 
> > I must have something wrongly configuration but I can't really see
> > why there is this relationship...
> > 
> > Gerard
> > 
> > On Tue, Oct 17, 2017 at 3:35 PM, Ken Gaillot <kgaillot at redhat.com>
> > wrote:
> > > On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote:
> > > > Thanks Ken. Yes, inspecting the logs seems that the failcount
> > > of the
> > > > correctly running resource reaches the maximum number of
> > > allowed
> > > > failures and gets banned in all nodes.
> > > >
> > > > What is weird is that I just see how the failcount for the
> > > first
> > > > resource gets updated, is like the failcount are being mixed.
> > > In
> > > > fact, when the two resources get banned the only way I have to
> > > make
> > > > the first one start is to disable the failing one and clean the
> > > > failcount of the two resources (it is not enough to only clean
> > > the
> > > > failcount of the first resource) does it make sense?
> > > >
> > > > Gerard
> > > 
> > > My suspicion is that you have two instances of the same service,
> > > and
> > > the resource agent monitor is only checking the general service,
> > > rather
> > > than a specific instance of it, so the monitors on both of them
> > > return
> > > failure if either one is failing.
> > > 
> > > That would make sense why you have to disable the failing
> > > resource, so
> > > its monitor stops running. I can't think of why you'd have to
> > > clean its
> > > failcount for the other one to start, though.
> > > 
> > > The "anything" agent very often causes more problems than it
> > > solves ...
> > >  I'd recommend writing your own OCF agent tailored to your
> > > service.
> > > It's not much more complicated than an init script.
> > > 
> > > > On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot <kgaillot at redhat.c
> > > om>
> > > > wrote:
> > > > > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I have a cluster with two ocf:heartbeat:anything resources
> > > each
> > > > > one
> > > > > > running as a clone in all nodes of the cluster. For some
> > > reason
> > > > > when
> > > > > > one of them fails to start the other one stops. There is
> > > not any
> > > > > > constrain configured or any kind of relation between them. 
> > > > > >
> > > > > > Is it possible that there is some kind of implicit relation
> > > that
> > > > > I'm
> > > > > > not aware of (for example because they are the same type?)
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Gerard
> > > > >
> > > > > There is no implicit relation on the Pacemaker side. However
> > > if the
> > > > > agent returns "failed" for both resources when either one
> > > fails,
> > > > > you
> > > > > could see something like that. I'd look at the logs on the DC
> > > and
> > > > > see
> > > > > why it decided to restart the second resource.
> > > > > --
> > > > > Ken Gaillot <kgaillot at redhat.com>
> > > > >
> > > > > _______________________________________________
> > > > > Users mailing list: Users at clusterlabs.org
> > > > > http://lists.clusterlabs.org/mailman/listinfo/users
> > > > >
> > > > > Project Home: http://www.clusterlabs.org
> > > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_
> > > Scratc
> > > > > h.pdf
> > > > > Bugs: http://bugs.clusterlabs.org
> > > > >
> > > >
> > > > _______________________________________________
> > > > Users mailing list: Users at clusterlabs.org
> > > > http://lists.clusterlabs.org/mailman/listinfo/users
> > > >
> > > > Project Home: http://www.clusterlabs.org
> > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Sc
> > > ratch.
> > > > pdf
> > > > Bugs: http://bugs.clusterlabs.org
> > > --
> > > Ken Gaillot <kgaillot at redhat.com>
> > > 
> > > _______________________________________________
> > > Users mailing list: Users at clusterlabs.org
> > > http://lists.clusterlabs.org/mailman/listinfo/users
> > > 
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra
> > > tch.pdf
> > > Bugs: http://bugs.clusterlabs.org
> > > 
> > 
> > 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot <kgaillot at redhat.com>