[ClusterLabs] Why Do All The Services Go Down When Just One Fails?

Tue Feb 19 12:40:05 EST 2019

> -----Original Message-----
> From: Users <users-bounces at clusterlabs.org> On Behalf Of Andrei
> Borzenkov
> Sent: Sunday, February 17, 2019 11:56 AM
> To: users at clusterlabs.org
> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One
> Fails?
> 
> 17.02.2019 0:44, Eric Robinson пишет:
> > Thanks for the feedback, Andrei.
> >
> > I only want cluster failover to occur if the filesystem or drbd resources fail,
> or if the cluster messaging layer detects a complete node failure. Is there a
> way to tell PaceMaker not to trigger a cluster failover if any of the p_mysql
> resources fail?
> >
> 
> Let's look at this differently. If all these applications depend on each other,
> you should not be able to stop individual resource in the first place - you
> need to group them or define dependency so that stopping any resource
> would stop everything.
> 
> If these applications are independent, they should not share resources.
> Each MySQL application should have own IP and own FS and own block
> device for this FS so that they can be moved between cluster nodes
> independently.
> 
> Anything else will lead to troubles as you already observed.

FYI, the MySQL services do not depend on each other. All of them depend on the floating IP, which depends on the filesystem, which depends on DRBD, but they do not depend on each other. Ideally, the failure of p_mysql_002 should not cause failure of other mysql resources, but now I understand why it happened. Pacemaker wanted to start it on the other node, so it needed to move the floating IP, filesystem, and DRBD primary, which had the cascade effect of stopping the other MySQL resources.

I think I also understand why the p_vip_clust01 resource blocked. 

FWIW, we've been using Linux HA since 2006, originally Heartbeat, but then Corosync+Pacemaker. The past 12 years have been relatively problem free. This symptom is new for us, only within the past year. Our cluster nodes have many separate instances of MySQL running, so it is not practical to have that many filesystems, IPs, etc. We are content with the way things are, except for this new troubling behavior.

If I understand the thread correctly, op-fail=stop will not work because the cluster will still try to stop the resources that are implied dependencies.

Bottom line is, how do we configure the cluster in such a way that there are no cascading circumstances when a MySQL resource fails? Basically, if a MySQL resource fails, it fails. We'll deal with that on an ad-hoc basis. I don't want the whole cluster to barf. What about op-fail=ignore? Earlier, you suggested symmetrical=false might also do the trick, but you said it comes with its own can or worms. What are the downsides with op-fail=ignore or asymmetrical=false?

--Eric

> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org