[ClusterLabs] When resource fails to start it stops an apparently unrelated resource

Wed Oct 18 18:04:56 EDT 2017

On Wed, 2017-10-18 at 16:58 +0200, Gerard Garcia wrote:
> I'm using version 1.1.15-11.el7_3.2-e174ec8. As far as I know the
> latest stable version in Centos 7.3
> 
> Gerard

Interesting ... this was an undetected bug that was coincidentally
fixed by the recent fail-count work released in 1.1.17. The bug only
affected cloned resources where one clone's name ended with the
other's.

FYI, CentOS 7.4 has 1.1.16, but that won't help this issue.

> 
> On Wed, Oct 18, 2017 at 4:42 PM, Ken Gaillot <kgaillot at redhat.com>
> wrote:
> > On Wed, 2017-10-18 at 14:25 +0200, Gerard Garcia wrote:
> > > So I think I found the problem. The two resources are named
> > forwarder
> > > and bgpforwarder. It doesn't matter if bgpforwarder exists. It is
> > > just that when I set the failcount to INFINITY to a resource
> > named
> > > bgpforwarder (crm_failcount -r bgpforwarder -v INFINITY) it
> > directly
> > > affects the forwarder resource. 
> > >
> > > If I change the name to forwarderbgp, the problem disappears. So
> > it
> > > seems that the problem is that Pacemaker mixes the bgpforwarder
> > and
> > > forwarder names. Is it a bug?
> > >
> > > Gerard
> > 
> > That's really surprising. What version of pacemaker are you using?
> > There were a lot of changes in fail count handling in the last few
> > releases.
> > 
> > >
> > > On Tue, Oct 17, 2017 at 6:27 PM, Gerard Garcia <gerard at talaia.io>
> > > wrote:
> > > > That makes sense. I've tried copying the anything resource and
> > > > changed its name and id (which I guess should be enough to make
> > > > pacemaker think they are different) but I still have the same
> > > > problem.
> > > >
> > > > After more debugging I have reduced the problem to this:
> > > > * First cloned resource running fine
> > > > * Second cloned resource running fine
> > > > * Manually set failcount to INFINITY to second cloned resource
> > > > * Pacemaker triggers an stop operation (without monitor
> > operation
> > > > failing) for the two resources in the node where the failcount
> > has
> > > > been set to INFINITY.
> > > > * Reset failcount starts the two resources again
> > > >
> > > > Weirdly enough the second resource doesn't stop if I set the
> > the
> > > > the first resource failcount to INFINITY (not even the first
> > > > resource stops...). 
> > > >
> > > > But:
> > > > * If I set the first resource as globally-unique=true it does
> > not
> > > > stop so somehow this breaks the relation.
> > > > * If I manually set the failcount to 0 in the first resource
> > that
> > > > also breaks the relation so it does not stop either. It seems
> > like
> > > > the failcount value is being inherited from the second resource
> > > > when it does not have any value. 
> > > >
> > > > I must have something wrongly configuration but I can't really
> > see
> > > > why there is this relationship...
> > > >
> > > > Gerard
> > > >
> > > > On Tue, Oct 17, 2017 at 3:35 PM, Ken Gaillot <kgaillot at redhat.c
> > om>
> > > > wrote:
> > > > > On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote:
> > > > > > Thanks Ken. Yes, inspecting the logs seems that the
> > failcount
> > > > > of the
> > > > > > correctly running resource reaches the maximum number of
> > > > > allowed
> > > > > > failures and gets banned in all nodes.
> > > > > >
> > > > > > What is weird is that I just see how the failcount for the
> > > > > first
> > > > > > resource gets updated, is like the failcount are being
> > mixed.
> > > > > In
> > > > > > fact, when the two resources get banned the only way I have
> > to
> > > > > make
> > > > > > the first one start is to disable the failing one and clean
> > the
> > > > > > failcount of the two resources (it is not enough to only
> > clean
> > > > > the
> > > > > > failcount of the first resource) does it make sense?
> > > > > >
> > > > > > Gerard
> > > > >
> > > > > My suspicion is that you have two instances of the same
> > service,
> > > > > and
> > > > > the resource agent monitor is only checking the general
> > service,
> > > > > rather
> > > > > than a specific instance of it, so the monitors on both of
> > them
> > > > > return
> > > > > failure if either one is failing.
> > > > >
> > > > > That would make sense why you have to disable the failing
> > > > > resource, so
> > > > > its monitor stops running. I can't think of why you'd have to
> > > > > clean its
> > > > > failcount for the other one to start, though.
> > > > >
> > > > > The "anything" agent very often causes more problems than it
> > > > > solves ...
> > > > >  I'd recommend writing your own OCF agent tailored to your
> > > > > service.
> > > > > It's not much more complicated than an init script.
> > > > >
> > > > > > On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot <kgaillot at redh
> > at.c
> > > > > om>
> > > > > > wrote:
> > > > > > > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote:
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I have a cluster with two ocf:heartbeat:anything
> > resources
> > > > > each
> > > > > > > one
> > > > > > > > running as a clone in all nodes of the cluster. For
> > some
> > > > > reason
> > > > > > > when
> > > > > > > > one of them fails to start the other one stops. There
> > is
> > > > > not any
> > > > > > > > constrain configured or any kind of relation between
> > them. 
> > > > > > > >
> > > > > > > > Is it possible that there is some kind of implicit
> > relation
> > > > > that
> > > > > > > I'm
> > > > > > > > not aware of (for example because they are the same
> > type?)
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Gerard
> > > > > > >
> > > > > > > There is no implicit relation on the Pacemaker side.
> > However
> > > > > if the
> > > > > > > agent returns "failed" for both resources when either one
> > > > > fails,
> > > > > > > you
> > > > > > > could see something like that. I'd look at the logs on
> > the DC
> > > > > and
> > > > > > > see
> > > > > > > why it decided to restart the second resource.
> > > > > > > --
> > > > > > > Ken Gaillot <kgaillot at redhat.com>
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > Users mailing list: Users at clusterlabs.org
> > > > > > > http://lists.clusterlabs.org/mailman/listinfo/users
> > > > > > >
> > > > > > > Project Home: http://www.clusterlabs.org
> > > > > > > Getting started: http://www.clusterlabs.org/doc/Cluster_f
> > rom_
> > > > > Scratc
> > > > > > > h.pdf
> > > > > > > Bugs: http://bugs.clusterlabs.org
> > > > > > >
> > > > > >
> > > > > > _______________________________________________
> > > > > > Users mailing list: Users at clusterlabs.org
> > > > > > http://lists.clusterlabs.org/mailman/listinfo/users
> > > > > >
> > > > > > Project Home: http://www.clusterlabs.org
> > > > > > Getting started: http://www.clusterlabs.org/doc/Cluster_fro
> > m_Sc
> > > > > ratch.
> > > > > > pdf
> > > > > > Bugs: http://bugs.clusterlabs.org
> > > > > --
> > > > > Ken Gaillot <kgaillot at redhat.com>
> > > > >
> > > > > _______________________________________________
> > > > > Users mailing list: Users at clusterlabs.org
> > > > > http://lists.clusterlabs.org/mailman/listinfo/users
> > > > >
> > > > > Project Home: http://www.clusterlabs.org
> > > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_
> > Scra
> > > > > tch.pdf
> > > > > Bugs: http://bugs.clusterlabs.org
> > > > >
> > > >
> > > >
> > >
> > > _______________________________________________
> > > Users mailing list: Users at clusterlabs.org
> > > http://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra
> > tch.
> > > pdf
> > > Bugs: http://bugs.clusterlabs.org
> > --
> > Ken Gaillot <kgaillot at redhat.com>
> > 
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
> > h.pdf
> > Bugs: http://bugs.clusterlabs.org
> > 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot <kgaillot at redhat.com>