[ClusterLabs] When resource fails to start it stops an apparently unrelated resource

Tue Oct 17 09:35:43 EDT 2017

On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote:
> Thanks Ken. Yes, inspecting the logs seems that the failcount of the
> correctly running resource reaches the maximum number of allowed
> failures and gets banned in all nodes.
> 
> What is weird is that I just see how the failcount for the first
> resource gets updated, is like the failcount are being mixed. In
> fact, when the two resources get banned the only way I have to make
> the first one start is to disable the failing one and clean the
> failcount of the two resources (it is not enough to only clean the
> failcount of the first resource) does it make sense?
> 
> Gerard

My suspicion is that you have two instances of the same service, and
the resource agent monitor is only checking the general service, rather
than a specific instance of it, so the monitors on both of them return
failure if either one is failing.

That would make sense why you have to disable the failing resource, so
its monitor stops running. I can't think of why you'd have to clean its
failcount for the other one to start, though.

The "anything" agent very often causes more problems than it solves ...
 I'd recommend writing your own OCF agent tailored to your service.
It's not much more complicated than an init script.

> On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot <kgaillot at redhat.com>
> wrote:
> > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote:
> > > Hi,
> > >
> > > I have a cluster with two ocf:heartbeat:anything resources each
> > one
> > > running as a clone in all nodes of the cluster. For some reason
> > when
> > > one of them fails to start the other one stops. There is not any
> > > constrain configured or any kind of relation between them. 
> > >
> > > Is it possible that there is some kind of implicit relation that
> > I'm
> > > not aware of (for example because they are the same type?)
> > >
> > > Thanks,
> > >
> > > Gerard
> > 
> > There is no implicit relation on the Pacemaker side. However if the
> > agent returns "failed" for both resources when either one fails,
> > you
> > could see something like that. I'd look at the logs on the DC and
> > see
> > why it decided to restart the second resource.
> > --
> > Ken Gaillot <kgaillot at redhat.com>
> > 
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
> > h.pdf
> > Bugs: http://bugs.clusterlabs.org
> > 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot <kgaillot at redhat.com>