[ClusterLabs] When resource fails to start it stops an apparently unrelated resource
Gerard Garcia
gerard at talaia.io
Thu Oct 19 10:03:03 CEST 2017
I'm so lucky :) thanks for your help!
Gerard
On Thu, Oct 19, 2017 at 12:04 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
> On Wed, 2017-10-18 at 16:58 +0200, Gerard Garcia wrote:
> > I'm using version 1.1.15-11.el7_3.2-e174ec8. As far as I know the
> > latest stable version in Centos 7.3
> >
> > Gerard
>
> Interesting ... this was an undetected bug that was coincidentally
> fixed by the recent fail-count work released in 1.1.17. The bug only
> affected cloned resources where one clone's name ended with the
> other's.
>
> FYI, CentOS 7.4 has 1.1.16, but that won't help this issue.
>
> >
> > On Wed, Oct 18, 2017 at 4:42 PM, Ken Gaillot <kgaillot at redhat.com>
> > wrote:
> > > On Wed, 2017-10-18 at 14:25 +0200, Gerard Garcia wrote:
> > > > So I think I found the problem. The two resources are named
> > > forwarder
> > > > and bgpforwarder. It doesn't matter if bgpforwarder exists. It is
> > > > just that when I set the failcount to INFINITY to a resource
> > > named
> > > > bgpforwarder (crm_failcount -r bgpforwarder -v INFINITY) it
> > > directly
> > > > affects the forwarder resource.
> > > >
> > > > If I change the name to forwarderbgp, the problem disappears. So
> > > it
> > > > seems that the problem is that Pacemaker mixes the bgpforwarder
> > > and
> > > > forwarder names. Is it a bug?
> > > >
> > > > Gerard
> > >
> > > That's really surprising. What version of pacemaker are you using?
> > > There were a lot of changes in fail count handling in the last few
> > > releases.
> > >
> > > >
> > > > On Tue, Oct 17, 2017 at 6:27 PM, Gerard Garcia <gerard at talaia.io>
> > > > wrote:
> > > > > That makes sense. I've tried copying the anything resource and
> > > > > changed its name and id (which I guess should be enough to make
> > > > > pacemaker think they are different) but I still have the same
> > > > > problem.
> > > > >
> > > > > After more debugging I have reduced the problem to this:
> > > > > * First cloned resource running fine
> > > > > * Second cloned resource running fine
> > > > > * Manually set failcount to INFINITY to second cloned resource
> > > > > * Pacemaker triggers an stop operation (without monitor
> > > operation
> > > > > failing) for the two resources in the node where the failcount
> > > has
> > > > > been set to INFINITY.
> > > > > * Reset failcount starts the two resources again
> > > > >
> > > > > Weirdly enough the second resource doesn't stop if I set the
> > > the
> > > > > the first resource failcount to INFINITY (not even the first
> > > > > resource stops...).
> > > > >
> > > > > But:
> > > > > * If I set the first resource as globally-unique=true it does
> > > not
> > > > > stop so somehow this breaks the relation.
> > > > > * If I manually set the failcount to 0 in the first resource
> > > that
> > > > > also breaks the relation so it does not stop either. It seems
> > > like
> > > > > the failcount value is being inherited from the second resource
> > > > > when it does not have any value.
> > > > >
> > > > > I must have something wrongly configuration but I can't really
> > > see
> > > > > why there is this relationship...
> > > > >
> > > > > Gerard
> > > > >
> > > > > On Tue, Oct 17, 2017 at 3:35 PM, Ken Gaillot <kgaillot at redhat.c
> > > om>
> > > > > wrote:
> > > > > > On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote:
> > > > > > > Thanks Ken. Yes, inspecting the logs seems that the
> > > failcount
> > > > > > of the
> > > > > > > correctly running resource reaches the maximum number of
> > > > > > allowed
> > > > > > > failures and gets banned in all nodes.
> > > > > > >
> > > > > > > What is weird is that I just see how the failcount for the
> > > > > > first
> > > > > > > resource gets updated, is like the failcount are being
> > > mixed.
> > > > > > In
> > > > > > > fact, when the two resources get banned the only way I have
> > > to
> > > > > > make
> > > > > > > the first one start is to disable the failing one and clean
> > > the
> > > > > > > failcount of the two resources (it is not enough to only
> > > clean
> > > > > > the
> > > > > > > failcount of the first resource) does it make sense?
> > > > > > >
> > > > > > > Gerard
> > > > > >
> > > > > > My suspicion is that you have two instances of the same
> > > service,
> > > > > > and
> > > > > > the resource agent monitor is only checking the general
> > > service,
> > > > > > rather
> > > > > > than a specific instance of it, so the monitors on both of
> > > them
> > > > > > return
> > > > > > failure if either one is failing.
> > > > > >
> > > > > > That would make sense why you have to disable the failing
> > > > > > resource, so
> > > > > > its monitor stops running. I can't think of why you'd have to
> > > > > > clean its
> > > > > > failcount for the other one to start, though.
> > > > > >
> > > > > > The "anything" agent very often causes more problems than it
> > > > > > solves ...
> > > > > > I'd recommend writing your own OCF agent tailored to your
> > > > > > service.
> > > > > > It's not much more complicated than an init script.
> > > > > >
> > > > > > > On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot <kgaillot at redh
> > > at.c
> > > > > > om>
> > > > > > > wrote:
> > > > > > > > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote:
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > I have a cluster with two ocf:heartbeat:anything
> > > resources
> > > > > > each
> > > > > > > > one
> > > > > > > > > running as a clone in all nodes of the cluster. For
> > > some
> > > > > > reason
> > > > > > > > when
> > > > > > > > > one of them fails to start the other one stops. There
> > > is
> > > > > > not any
> > > > > > > > > constrain configured or any kind of relation between
> > > them.
> > > > > > > > >
> > > > > > > > > Is it possible that there is some kind of implicit
> > > relation
> > > > > > that
> > > > > > > > I'm
> > > > > > > > > not aware of (for example because they are the same
> > > type?)
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Gerard
> > > > > > > >
> > > > > > > > There is no implicit relation on the Pacemaker side.
> > > However
> > > > > > if the
> > > > > > > > agent returns "failed" for both resources when either one
> > > > > > fails,
> > > > > > > > you
> > > > > > > > could see something like that. I'd look at the logs on
> > > the DC
> > > > > > and
> > > > > > > > see
> > > > > > > > why it decided to restart the second resource.
> > > > > > > > --
> > > > > > > > Ken Gaillot <kgaillot at redhat.com>
> > > > > > > >
> > > > > > > > _______________________________________________
> > > > > > > > Users mailing list: Users at clusterlabs.org
> > > > > > > > http://lists.clusterlabs.org/mailman/listinfo/users
> > > > > > > >
> > > > > > > > Project Home: http://www.clusterlabs.org
> > > > > > > > Getting started: http://www.clusterlabs.org/doc/Cluster_f
> > > rom_
> > > > > > Scratc
> > > > > > > > h.pdf
> > > > > > > > Bugs: http://bugs.clusterlabs.org
> > > > > > > >
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > Users mailing list: Users at clusterlabs.org
> > > > > > > http://lists.clusterlabs.org/mailman/listinfo/users
> > > > > > >
> > > > > > > Project Home: http://www.clusterlabs.org
> > > > > > > Getting started: http://www.clusterlabs.org/doc/Cluster_fro
> > > m_Sc
> > > > > > ratch.
> > > > > > > pdf
> > > > > > > Bugs: http://bugs.clusterlabs.org
> > > > > > --
> > > > > > Ken Gaillot <kgaillot at redhat.com>
> > > > > >
> > > > > > _______________________________________________
> > > > > > Users mailing list: Users at clusterlabs.org
> > > > > > http://lists.clusterlabs.org/mailman/listinfo/users
> > > > > >
> > > > > > Project Home: http://www.clusterlabs.org
> > > > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_
> > > Scra
> > > > > > tch.pdf
> > > > > > Bugs: http://bugs.clusterlabs.org
> > > > > >
> > > > >
> > > > >
> > > >
> > > > _______________________________________________
> > > > Users mailing list: Users at clusterlabs.org
> > > > http://lists.clusterlabs.org/mailman/listinfo/users
> > > >
> > > > Project Home: http://www.clusterlabs.org
> > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra
> > > tch.
> > > > pdf
> > > > Bugs: http://bugs.clusterlabs.org
> > > --
> > > Ken Gaillot <kgaillot at redhat.com>
> > >
> > > _______________________________________________
> > > Users mailing list: Users at clusterlabs.org
> > > http://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
> > > h.pdf
> > > Bugs: http://bugs.clusterlabs.org
> > >
> >
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> > pdf
> > Bugs: http://bugs.clusterlabs.org
> --
> Ken Gaillot <kgaillot at redhat.com>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20171019/eb7f768a/attachment.html>
More information about the Users
mailing list