[ClusterLabs] Why Do All The Services Go Down When Just One Fails?

Ken Gaillot kgaillot at redhat.com
Tue Feb 19 13:31:23 EST 2019


On Tue, 2019-02-19 at 17:40 +0000, Eric Robinson wrote:
> > -----Original Message-----
> > From: Users <users-bounces at clusterlabs.org> On Behalf Of Andrei
> > Borzenkov
> > Sent: Sunday, February 17, 2019 11:56 AM
> > To: users at clusterlabs.org
> > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When
> > Just One
> > Fails?
> > 
> > 17.02.2019 0:44, Eric Robinson пишет:
> > > Thanks for the feedback, Andrei.
> > > 
> > > I only want cluster failover to occur if the filesystem or drbd
> > > resources fail,
> > 
> > or if the cluster messaging layer detects a complete node failure.
> > Is there a
> > way to tell PaceMaker not to trigger a cluster failover if any of
> > the p_mysql
> > resources fail?
> > > 
> > 
> > Let's look at this differently. If all these applications depend on
> > each other,
> > you should not be able to stop individual resource in the first
> > place - you
> > need to group them or define dependency so that stopping any
> > resource
> > would stop everything.
> > 
> > If these applications are independent, they should not share
> > resources.
> > Each MySQL application should have own IP and own FS and own block
> > device for this FS so that they can be moved between cluster nodes
> > independently.
> > 
> > Anything else will lead to troubles as you already observed.
> 
> FYI, the MySQL services do not depend on each other. All of them
> depend on the floating IP, which depends on the filesystem, which
> depends on DRBD, but they do not depend on each other. Ideally, the
> failure of p_mysql_002 should not cause failure of other mysql
> resources, but now I understand why it happened. Pacemaker wanted to
> start it on the other node, so it needed to move the floating IP,
> filesystem, and DRBD primary, which had the cascade effect of
> stopping the other MySQL resources.
> 
> I think I also understand why the p_vip_clust01 resource blocked. 
> 
> FWIW, we've been using Linux HA since 2006, originally Heartbeat, but
> then Corosync+Pacemaker. The past 12 years have been relatively
> problem free. This symptom is new for us, only within the past year.
> Our cluster nodes have many separate instances of MySQL running, so
> it is not practical to have that many filesystems, IPs, etc. We are
> content with the way things are, except for this new troubling
> behavior.
> 
> If I understand the thread correctly, op-fail=stop will not work
> because the cluster will still try to stop the resources that are
> implied dependencies.
> 
> Bottom line is, how do we configure the cluster in such a way that
> there are no cascading circumstances when a MySQL resource fails?
> Basically, if a MySQL resource fails, it fails. We'll deal with that
> on an ad-hoc basis. I don't want the whole cluster to barf. What
> about op-fail=ignore? Earlier, you suggested symmetrical=false might
> also do the trick, but you said it comes with its own can or worms.
> What are the downsides with op-fail=ignore or asymmetrical=false?
> 
> --Eric

Even adding on-fail=ignore to the recurring monitors may not do what
you want, because I suspect that even an ignored failure will make the
node less preferable for all the other resources. But it's worth
testing.

Otherwise, your best option is to remove all the recurring monitors
from the mysql resources, and rely on external monitoring (e.g. nagios,
icinga, monit, ...) to detect problems.
-- 
Ken Gaillot <kgaillot at redhat.com>




More information about the Users mailing list