[ClusterLabs] When resource fails to start it stops an apparently unrelated resource

Tue Oct 17 16:27:50 UTC 2017

That makes sense. I've tried copying the anything resource and changed its
name and id (which I guess should be enough to make pacemaker think they
are different) but I still have the same problem.

After more debugging I have reduced the problem to this:
* First cloned resource running fine
* Second cloned resource running fine
* Manually set failcount to INFINITY to second cloned resource
* Pacemaker triggers an stop operation (without monitor operation failing)
for the two resources in the node where the failcount has been set to
INFINITY.
* Reset failcount starts the two resources again

Weirdly enough the second resource doesn't stop if I set the the the first
resource failcount to INFINITY (not even the first resource stops...).

But:
* If I set the first resource as globally-unique=true it does not stop so
somehow this breaks the relation.
* If I manually set the failcount to 0 in the first resource that also
breaks the relation so it does not stop either. It seems like the failcount
value is being inherited from the second resource when it does not have any
value.

I must have something wrongly configuration but I can't really see why
there is this relationship...

Gerard

On Tue, Oct 17, 2017 at 3:35 PM, Ken Gaillot <kgaillot at redhat.com> wrote:

> On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote:
> > Thanks Ken. Yes, inspecting the logs seems that the failcount of the
> > correctly running resource reaches the maximum number of allowed
> > failures and gets banned in all nodes.
> >
> > What is weird is that I just see how the failcount for the first
> > resource gets updated, is like the failcount are being mixed. In
> > fact, when the two resources get banned the only way I have to make
> > the first one start is to disable the failing one and clean the
> > failcount of the two resources (it is not enough to only clean the
> > failcount of the first resource) does it make sense?
> >
> > Gerard
>
> My suspicion is that you have two instances of the same service, and
> the resource agent monitor is only checking the general service, rather
> than a specific instance of it, so the monitors on both of them return
> failure if either one is failing.
>
> That would make sense why you have to disable the failing resource, so
> its monitor stops running. I can't think of why you'd have to clean its
> failcount for the other one to start, though.
>
> The "anything" agent very often causes more problems than it solves ...
>  I'd recommend writing your own OCF agent tailored to your service.
> It's not much more complicated than an init script.
>
> > On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot <kgaillot at redhat.com>
> > wrote:
> > > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote:
> > > > Hi,
> > > >
> > > > I have a cluster with two ocf:heartbeat:anything resources each
> > > one
> > > > running as a clone in all nodes of the cluster. For some reason
> > > when
> > > > one of them fails to start the other one stops. There is not any
> > > > constrain configured or any kind of relation between them.
> > > >
> > > > Is it possible that there is some kind of implicit relation that
> > > I'm
> > > > not aware of (for example because they are the same type?)
> > > >
> > > > Thanks,
> > > >
> > > > Gerard
> > >
> > > There is no implicit relation on the Pacemaker side. However if the
> > > agent returns "failed" for both resources when either one fails,
> > > you
> > > could see something like that. I'd look at the logs on the DC and
> > > see
> > > why it decided to restart the second resource.
> > > --
> > > Ken Gaillot <kgaillot at redhat.com>
> > >
> > > _______________________________________________
> > > Users mailing list: Users at clusterlabs.org
> > > http://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
> > > h.pdf
> > > Bugs: http://bugs.clusterlabs.org
> > >
> >
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> > pdf
> > Bugs: http://bugs.clusterlabs.org
> --
> Ken Gaillot <kgaillot at redhat.com>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20171017/099dcf75/attachment-0002.html>