[ClusterLabs] When resource fails to start it stops an apparently unrelated resource

Wed Oct 18 10:58:23 EDT 2017

I'm using version 1.1.15-11.el7_3.2-e174ec8. As far as I know the latest
stable version in Centos 7.3

Gerard

On Wed, Oct 18, 2017 at 4:42 PM, Ken Gaillot <kgaillot at redhat.com> wrote:

> On Wed, 2017-10-18 at 14:25 +0200, Gerard Garcia wrote:
> > So I think I found the problem. The two resources are named forwarder
> > and bgpforwarder. It doesn't matter if bgpforwarder exists. It is
> > just that when I set the failcount to INFINITY to a resource named
> > bgpforwarder (crm_failcount -r bgpforwarder -v INFINITY) it directly
> > affects the forwarder resource.
> >
> > If I change the name to forwarderbgp, the problem disappears. So it
> > seems that the problem is that Pacemaker mixes the bgpforwarder and
> > forwarder names. Is it a bug?
> >
> > Gerard
>
> That's really surprising. What version of pacemaker are you using?
> There were a lot of changes in fail count handling in the last few
> releases.
>
> >
> > On Tue, Oct 17, 2017 at 6:27 PM, Gerard Garcia <gerard at talaia.io>
> > wrote:
> > > That makes sense. I've tried copying the anything resource and
> > > changed its name and id (which I guess should be enough to make
> > > pacemaker think they are different) but I still have the same
> > > problem.
> > >
> > > After more debugging I have reduced the problem to this:
> > > * First cloned resource running fine
> > > * Second cloned resource running fine
> > > * Manually set failcount to INFINITY to second cloned resource
> > > * Pacemaker triggers an stop operation (without monitor operation
> > > failing) for the two resources in the node where the failcount has
> > > been set to INFINITY.
> > > * Reset failcount starts the two resources again
> > >
> > > Weirdly enough the second resource doesn't stop if I set the the
> > > the first resource failcount to INFINITY (not even the first
> > > resource stops...).
> > >
> > > But:
> > > * If I set the first resource as globally-unique=true it does not
> > > stop so somehow this breaks the relation.
> > > * If I manually set the failcount to 0 in the first resource that
> > > also breaks the relation so it does not stop either. It seems like
> > > the failcount value is being inherited from the second resource
> > > when it does not have any value.
> > >
> > > I must have something wrongly configuration but I can't really see
> > > why there is this relationship...
> > >
> > > Gerard
> > >
> > > On Tue, Oct 17, 2017 at 3:35 PM, Ken Gaillot <kgaillot at redhat.com>
> > > wrote:
> > > > On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote:
> > > > > Thanks Ken. Yes, inspecting the logs seems that the failcount
> > > > of the
> > > > > correctly running resource reaches the maximum number of
> > > > allowed
> > > > > failures and gets banned in all nodes.
> > > > >
> > > > > What is weird is that I just see how the failcount for the
> > > > first
> > > > > resource gets updated, is like the failcount are being mixed.
> > > > In
> > > > > fact, when the two resources get banned the only way I have to
> > > > make
> > > > > the first one start is to disable the failing one and clean the
> > > > > failcount of the two resources (it is not enough to only clean
> > > > the
> > > > > failcount of the first resource) does it make sense?
> > > > >
> > > > > Gerard
> > > >
> > > > My suspicion is that you have two instances of the same service,
> > > > and
> > > > the resource agent monitor is only checking the general service,
> > > > rather
> > > > than a specific instance of it, so the monitors on both of them
> > > > return
> > > > failure if either one is failing.
> > > >
> > > > That would make sense why you have to disable the failing
> > > > resource, so
> > > > its monitor stops running. I can't think of why you'd have to
> > > > clean its
> > > > failcount for the other one to start, though.
> > > >
> > > > The "anything" agent very often causes more problems than it
> > > > solves ...
> > > >  I'd recommend writing your own OCF agent tailored to your
> > > > service.
> > > > It's not much more complicated than an init script.
> > > >
> > > > > On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot <kgaillot at redhat.c
> > > > om>
> > > > > wrote:
> > > > > > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > I have a cluster with two ocf:heartbeat:anything resources
> > > > each
> > > > > > one
> > > > > > > running as a clone in all nodes of the cluster. For some
> > > > reason
> > > > > > when
> > > > > > > one of them fails to start the other one stops. There is
> > > > not any
> > > > > > > constrain configured or any kind of relation between them.
> > > > > > >
> > > > > > > Is it possible that there is some kind of implicit relation
> > > > that
> > > > > > I'm
> > > > > > > not aware of (for example because they are the same type?)
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Gerard
> > > > > >
> > > > > > There is no implicit relation on the Pacemaker side. However
> > > > if the
> > > > > > agent returns "failed" for both resources when either one
> > > > fails,
> > > > > > you
> > > > > > could see something like that. I'd look at the logs on the DC
> > > > and
> > > > > > see
> > > > > > why it decided to restart the second resource.
> > > > > > --
> > > > > > Ken Gaillot <kgaillot at redhat.com>
> > > > > >
> > > > > > _______________________________________________
> > > > > > Users mailing list: Users at clusterlabs.org
> > > > > > http://lists.clusterlabs.org/mailman/listinfo/users
> > > > > >
> > > > > > Project Home: http://www.clusterlabs.org
> > > > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_
> > > > Scratc
> > > > > > h.pdf
> > > > > > Bugs: http://bugs.clusterlabs.org
> > > > > >
> > > > >
> > > > > _______________________________________________
> > > > > Users mailing list: Users at clusterlabs.org
> > > > > http://lists.clusterlabs.org/mailman/listinfo/users
> > > > >
> > > > > Project Home: http://www.clusterlabs.org
> > > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Sc
> > > > ratch.
> > > > > pdf
> > > > > Bugs: http://bugs.clusterlabs.org
> > > > --
> > > > Ken Gaillot <kgaillot at redhat.com>
> > > >
> > > > _______________________________________________
> > > > Users mailing list: Users at clusterlabs.org
> > > > http://lists.clusterlabs.org/mailman/listinfo/users
> > > >
> > > > Project Home: http://www.clusterlabs.org
> > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra
> > > > tch.pdf
> > > > Bugs: http://bugs.clusterlabs.org
> > > >
> > >
> > >
> >
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> > pdf
> > Bugs: http://bugs.clusterlabs.org
> --
> Ken Gaillot <kgaillot at redhat.com>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20171018/1170ae85/attachment-0003.html>