[ClusterLabs] Antw: Re: Why Do All The Services Go Down When Just One Fails?

Wed Feb 20 13:51:18 EST 2019

> -----Original Message-----

> From: Users <users-bounces at clusterlabs.org> On Behalf Of Ulrich Windl

> Sent: Tuesday, February 19, 2019 11:35 PM

> To: users at clusterlabs.org

> Subject: [ClusterLabs] Antw: Re: Why Do All The Services Go Down When

> Just One Fails?

>

> >>> Eric Robinson <eric.robinson at psmnv.com<mailto:eric.robinson at psmnv.com>> schrieb am 19.02.2019 um

> >>> 21:06 in

> Nachricht

> <MN2PR03MB4845BE22FADA30B472174B79FA7C0 at MN2PR03MB4845.nampr<mailto:MN2PR03MB4845BE22FADA30B472174B79FA7C0 at MN2PR03MB4845.namprd03.prod.outlook.com>

> d03.prod.outlook.com<mailto:MN2PR03MB4845BE22FADA30B472174B79FA7C0 at MN2PR03MB4845.namprd03.prod.outlook.com>>

>

> >>  -----Original Message-----

> >> From: Users <users-bounces at clusterlabs.org<mailto:users-bounces at clusterlabs.org>> On Behalf Of Ken Gaillot

> >> Sent: Tuesday, February 19, 2019 10:31 AM

> >> To: Cluster Labs - All topics related to open-source clustering

> >> welcomed <users at clusterlabs.org<mailto:users at clusterlabs.org>>

> >> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just

> >> One Fails?

> >>

> >> On Tue, 2019-02-19 at 17:40 +0000, Eric Robinson wrote:

> >> > > -----Original Message-----

> >> > > From: Users <users-bounces at clusterlabs.org<mailto:users-bounces at clusterlabs.org>> On Behalf Of Andrei

> >> > > Borzenkov

> >> > > Sent: Sunday, February 17, 2019 11:56 AM

> >> > > To: users at clusterlabs.org<mailto:users at clusterlabs.org>

> >> > > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When

> >> > > Just One Fails?

> >> > >

> >> > > 17.02.2019 0:44, Eric Robinson пишет:

> >> > > > Thanks for the feedback, Andrei.

> >> > > >

> >> > > > I only want cluster failover to occur if the filesystem or drbd

> >> > > > resources fail,

> >> > >

> >> > > or if the cluster messaging layer detects a complete node failure.

> >> > > Is there a

> >> > > way to tell PaceMaker not to trigger a cluster failover if any of

> >> > > the p_mysql resources fail?

> >> > > >

> >> > >

> >> > > Let's look at this differently. If all these applications depend

> >> > > on each other, you should not be able to stop individual resource

> >> > > in the first place - you need to group them or define dependency

> >> > > so that stopping any resource would stop everything.

> >> > >

> >> > > If these applications are independent, they should not share

> >> > > resources.

> >> > > Each MySQL application should have own IP and own FS and own

> >> > > block device for this FS so that they can be moved between

> >> > > cluster nodes independently.

> >> > >

> >> > > Anything else will lead to troubles as you already observed.

> >> >

> >> > FYI, the MySQL services do not depend on each other. All of them

> >> > depend on the floating IP, which depends on the filesystem, which

> >> > depends on DRBD, but they do not depend on each other. Ideally, the

> >> > failure of p_mysql_002 should not cause failure of other mysql

> >> > resources, but now I understand why it happened. Pacemaker wanted

> >> > to start it on the other node, so it needed to move the floating

> >> > IP, filesystem, and DRBD primary, which had the cascade effect of

> >> > stopping the other MySQL resources.

> >> >

> >> > I think I also understand why the p_vip_clust01 resource blocked.

> >> >

> >> > FWIW, we've been using Linux HA since 2006, originally Heartbeat,

> >> > but then Corosync+Pacemaker. The past 12 years have been relatively

> >> > problem free. This symptom is new for us, only within the past year.

> >> > Our cluster nodes have many separate instances of MySQL running, so

> >> > it is not practical to have that many filesystems, IPs, etc. We are

> >> > content with the way things are, except for this new troubling

> >> > behavior.

> >> >

> >> > If I understand the thread correctly, op-fail=stop will not work

> >> > because the cluster will still try to stop the resources that are

> >> > implied dependencies.

> >> >

> >> > Bottom line is, how do we configure the cluster in such a way that

> >> > there are no cascading circumstances when a MySQL resource fails?

> >> > Basically, if a MySQL resource fails, it fails. We'll deal with

> >> > that on an ad-hoc basis. I don't want the whole cluster to barf.

> >> > What about op-fail=ignore? Earlier, you suggested symmetrical=false

> >> > might also do the trick, but you said it comes with its own can or worms.

> >> > What are the downsides with op-fail=ignore or asymmetrical=false?

> >> >

> >> > --Eric

> >>

> >> Even adding on-fail=ignore to the recurring monitors may not do what

> >> you want, because I suspect that even an ignored failure will make

> >> the node

> less

> >> preferable for all the other resources. But it's worth testing.

> >>

> >> Otherwise, your best option is to remove all the recurring monitors

> >> from

> the

> >> mysql resources, and rely on external monitoring (e.g. nagios,

> >> icinga,

> > monit,

> >> ...) to detect problems.

> >

> > This is probably a dumb question, but can we remove just the monitor

> > operation but leave the resource configured in the cluster? If a node

> > fails

>

> > over, we do want the resources to start automatically on the new

> > primary node.

>

> Actually I wonder whether this makes sense at all: IMHO a cluster ensures

> that the phone does not ring at night to make me perform some recovery

> operations after a failure. Once you move to manual start and stop of

> resources, I fail to see the reason for a cluster.

>

> When done well, independent resources should be configured (and

> managed) independently; otherwise they are dependent. There is no

> "middle-way".

>

> Regards,

> Ulrich

The following should show OK in a fixed font like Consolas, but the following setup is supposed to be possible, and is even referenced in the ClusterLabs documentation.

+--------------+

|   mysql001   +--+

+--------------+  |

+--------------+  |

|   mysql002   +--+

+--------------+  |

+--------------+  |   +-------------+   +------------+   +----------+

|   mysql003   +----->+ floating ip +-->+ filesystem +-->+ blockdev |

+--------------+  |   +-------------+   +------------+   +----------+

+--------------+  |

|   mysql004   +--+

+--------------+  |

+--------------+  |

|   mysql005   +--+

+--------------+

In the layout above, the MySQL instances are dependent on the same underlying service stack, but they are not dependent on each other. Therefore, as I understand it, the failure of one MySQL instance should not cause the failure of other MySQL instances if on-fail=ignore on-fail=stop. At least, that’s the way it seems to me, but based on the thread, I guess it does not behave that way.

>

> >

> >> --

> >> Ken Gaillot <kgaillot at redhat.com<mailto:kgaillot at redhat.com>>

> >>

> >> _______________________________________________

> >> Users mailing list: Users at clusterlabs.org<mailto:Users at clusterlabs.org>

> >> https://lists.clusterlabs.org/mailman/listinfo/users

> >>

> >> Project Home: http://www.clusterlabs.org Getting started:

> >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

> >> Bugs: http://bugs.clusterlabs.org

> > _______________________________________________

> > Users mailing list: Users at clusterlabs.org<mailto:Users at clusterlabs.org>

> > https://lists.clusterlabs.org/mailman/listinfo/users

> >

> > Project Home: http://www.clusterlabs.org Getting started:

> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

> > Bugs: http://bugs.clusterlabs.org

>

>

>

> _______________________________________________

> Users mailing list: Users at clusterlabs.org<mailto:Users at clusterlabs.org>

> https://lists.clusterlabs.org/mailman/listinfo/users

>

> Project Home: http://www.clusterlabs.org Getting started:

> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190220/c971599a/attachment-0002.html>