[ClusterLabs] Why Do All The Services Go Down When Just One Fails?

Eric Robinson eric.robinson at psmnv.com
Tue Feb 19 15:06:53 EST 2019


> -----Original Message-----
> From: Users <users-bounces at clusterlabs.org> On Behalf Of Ken Gaillot
> Sent: Tuesday, February 19, 2019 10:31 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> <users at clusterlabs.org>
> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One
> Fails?
> 
> On Tue, 2019-02-19 at 17:40 +0000, Eric Robinson wrote:
> > > -----Original Message-----
> > > From: Users <users-bounces at clusterlabs.org> On Behalf Of Andrei
> > > Borzenkov
> > > Sent: Sunday, February 17, 2019 11:56 AM
> > > To: users at clusterlabs.org
> > > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just
> > > One Fails?
> > >
> > > 17.02.2019 0:44, Eric Robinson пишет:
> > > > Thanks for the feedback, Andrei.
> > > >
> > > > I only want cluster failover to occur if the filesystem or drbd
> > > > resources fail,
> > >
> > > or if the cluster messaging layer detects a complete node failure.
> > > Is there a
> > > way to tell PaceMaker not to trigger a cluster failover if any of
> > > the p_mysql resources fail?
> > > >
> > >
> > > Let's look at this differently. If all these applications depend on
> > > each other, you should not be able to stop individual resource in
> > > the first place - you need to group them or define dependency so
> > > that stopping any resource would stop everything.
> > >
> > > If these applications are independent, they should not share
> > > resources.
> > > Each MySQL application should have own IP and own FS and own block
> > > device for this FS so that they can be moved between cluster nodes
> > > independently.
> > >
> > > Anything else will lead to troubles as you already observed.
> >
> > FYI, the MySQL services do not depend on each other. All of them
> > depend on the floating IP, which depends on the filesystem, which
> > depends on DRBD, but they do not depend on each other. Ideally, the
> > failure of p_mysql_002 should not cause failure of other mysql
> > resources, but now I understand why it happened. Pacemaker wanted to
> > start it on the other node, so it needed to move the floating IP,
> > filesystem, and DRBD primary, which had the cascade effect of stopping
> > the other MySQL resources.
> >
> > I think I also understand why the p_vip_clust01 resource blocked.
> >
> > FWIW, we've been using Linux HA since 2006, originally Heartbeat, but
> > then Corosync+Pacemaker. The past 12 years have been relatively
> > problem free. This symptom is new for us, only within the past year.
> > Our cluster nodes have many separate instances of MySQL running, so it
> > is not practical to have that many filesystems, IPs, etc. We are
> > content with the way things are, except for this new troubling
> > behavior.
> >
> > If I understand the thread correctly, op-fail=stop will not work
> > because the cluster will still try to stop the resources that are
> > implied dependencies.
> >
> > Bottom line is, how do we configure the cluster in such a way that
> > there are no cascading circumstances when a MySQL resource fails?
> > Basically, if a MySQL resource fails, it fails. We'll deal with that
> > on an ad-hoc basis. I don't want the whole cluster to barf. What about
> > op-fail=ignore? Earlier, you suggested symmetrical=false might also do
> > the trick, but you said it comes with its own can or worms.
> > What are the downsides with op-fail=ignore or asymmetrical=false?
> >
> > --Eric
> 
> Even adding on-fail=ignore to the recurring monitors may not do what you
> want, because I suspect that even an ignored failure will make the node less
> preferable for all the other resources. But it's worth testing.
> 
> Otherwise, your best option is to remove all the recurring monitors from the
> mysql resources, and rely on external monitoring (e.g. nagios, icinga, monit,
> ...) to detect problems.

This is probably a dumb question, but can we remove just the monitor operation but leave the resource configured in the cluster? If a node fails over, we do want the resources to start automatically on the new primary node.

> --
> Ken Gaillot <kgaillot at redhat.com>
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


More information about the Users mailing list