[ClusterLabs] When resource fails to start it stops an apparently unrelated resource

Wed Oct 18 12:25:07 UTC 2017

So I think I found the problem. The two resources are named forwarder and
bgpforwarder. It doesn't matter if bgpforwarder exists. It is just that
when I set the failcount to INFINITY to a resource named bgpforwarder
(crm_failcount -r bgpforwarder -v INFINITY) it directly affects the
forwarder resource.

If I change the name to forwarderbgp, the problem disappears. So it seems
that the problem is that Pacemaker mixes the bgpforwarder and forwarder
names. Is it a bug?

Gerard

On Tue, Oct 17, 2017 at 6:27 PM, Gerard Garcia <gerard at talaia.io> wrote:

> That makes sense. I've tried copying the anything resource and changed its
> name and id (which I guess should be enough to make pacemaker think they
> are different) but I still have the same problem.
>
> After more debugging I have reduced the problem to this:
> * First cloned resource running fine
> * Second cloned resource running fine
> * Manually set failcount to INFINITY to second cloned resource
> * Pacemaker triggers an stop operation (without monitor operation failing)
> for the two resources in the node where the failcount has been set to
> INFINITY.
> * Reset failcount starts the two resources again
>
> Weirdly enough the second resource doesn't stop if I set the the the first
> resource failcount to INFINITY (not even the first resource stops...).
>
> But:
> * If I set the first resource as globally-unique=true it does not stop so
> somehow this breaks the relation.
> * If I manually set the failcount to 0 in the first resource that also
> breaks the relation so it does not stop either. It seems like the failcount
> value is being inherited from the second resource when it does not have any
> value.
>
> I must have something wrongly configuration but I can't really see why
> there is this relationship...
>
> Gerard
>
> On Tue, Oct 17, 2017 at 3:35 PM, Ken Gaillot <kgaillot at redhat.com> wrote:
>
>> On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote:
>> > Thanks Ken. Yes, inspecting the logs seems that the failcount of the
>> > correctly running resource reaches the maximum number of allowed
>> > failures and gets banned in all nodes.
>> >
>> > What is weird is that I just see how the failcount for the first
>> > resource gets updated, is like the failcount are being mixed. In
>> > fact, when the two resources get banned the only way I have to make
>> > the first one start is to disable the failing one and clean the
>> > failcount of the two resources (it is not enough to only clean the
>> > failcount of the first resource) does it make sense?
>> >
>> > Gerard
>>
>> My suspicion is that you have two instances of the same service, and
>> the resource agent monitor is only checking the general service, rather
>> than a specific instance of it, so the monitors on both of them return
>> failure if either one is failing.
>>
>> That would make sense why you have to disable the failing resource, so
>> its monitor stops running. I can't think of why you'd have to clean its
>> failcount for the other one to start, though.
>>
>> The "anything" agent very often causes more problems than it solves ...
>>  I'd recommend writing your own OCF agent tailored to your service.
>> It's not much more complicated than an init script.
>>
>> > On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot <kgaillot at redhat.com>
>> > wrote:
>> > > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote:
>> > > > Hi,
>> > > >
>> > > > I have a cluster with two ocf:heartbeat:anything resources each
>> > > one
>> > > > running as a clone in all nodes of the cluster. For some reason
>> > > when
>> > > > one of them fails to start the other one stops. There is not any
>> > > > constrain configured or any kind of relation between them.
>> > > >
>> > > > Is it possible that there is some kind of implicit relation that
>> > > I'm
>> > > > not aware of (for example because they are the same type?)
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Gerard
>> > >
>> > > There is no implicit relation on the Pacemaker side. However if the
>> > > agent returns "failed" for both resources when either one fails,
>> > > you
>> > > could see something like that. I'd look at the logs on the DC and
>> > > see
>> > > why it decided to restart the second resource.
>> > > --
>> > > Ken Gaillot <kgaillot at redhat.com>
>> > >
>> > > _______________________________________________
>> > > Users mailing list: Users at clusterlabs.org
>> > > http://lists.clusterlabs.org/mailman/listinfo/users
>> > >
>> > > Project Home: http://www.clusterlabs.org
>> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
>> > > h.pdf
>> > > Bugs: http://bugs.clusterlabs.org
>> > >
>> >
>> > _______________________________________________
>> > Users mailing list: Users at clusterlabs.org
>> > http://lists.clusterlabs.org/mailman/listinfo/users
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
>> > pdf
>> > Bugs: http://bugs.clusterlabs.org
>> --
>> Ken Gaillot <kgaillot at redhat.com>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20171018/93c13c14/attachment-0002.html>