[ClusterLabs] Antw: Re: Why Do All The Services Go Down When Just One Fails?
Eric Robinson
eric.robinson at psmnv.com
Wed Feb 20 13:51:18 EST 2019
> -----Original Message-----
> From: Users <users-bounces at clusterlabs.org> On Behalf Of Ulrich Windl
> Sent: Tuesday, February 19, 2019 11:35 PM
> To: users at clusterlabs.org
> Subject: [ClusterLabs] Antw: Re: Why Do All The Services Go Down When
> Just One Fails?
>
> >>> Eric Robinson <eric.robinson at psmnv.com<mailto:eric.robinson at psmnv.com>> schrieb am 19.02.2019 um
> >>> 21:06 in
> Nachricht
> <MN2PR03MB4845BE22FADA30B472174B79FA7C0 at MN2PR03MB4845.nampr<mailto:MN2PR03MB4845BE22FADA30B472174B79FA7C0 at MN2PR03MB4845.namprd03.prod.outlook.com>
> d03.prod.outlook.com<mailto:MN2PR03MB4845BE22FADA30B472174B79FA7C0 at MN2PR03MB4845.namprd03.prod.outlook.com>>
>
> >> -----Original Message-----
> >> From: Users <users-bounces at clusterlabs.org<mailto:users-bounces at clusterlabs.org>> On Behalf Of Ken Gaillot
> >> Sent: Tuesday, February 19, 2019 10:31 AM
> >> To: Cluster Labs - All topics related to open-source clustering
> >> welcomed <users at clusterlabs.org<mailto:users at clusterlabs.org>>
> >> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just
> >> One Fails?
> >>
> >> On Tue, 2019-02-19 at 17:40 +0000, Eric Robinson wrote:
> >> > > -----Original Message-----
> >> > > From: Users <users-bounces at clusterlabs.org<mailto:users-bounces at clusterlabs.org>> On Behalf Of Andrei
> >> > > Borzenkov
> >> > > Sent: Sunday, February 17, 2019 11:56 AM
> >> > > To: users at clusterlabs.org<mailto:users at clusterlabs.org>
> >> > > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When
> >> > > Just One Fails?
> >> > >
> >> > > 17.02.2019 0:44, Eric Robinson пишет:
> >> > > > Thanks for the feedback, Andrei.
> >> > > >
> >> > > > I only want cluster failover to occur if the filesystem or drbd
> >> > > > resources fail,
> >> > >
> >> > > or if the cluster messaging layer detects a complete node failure.
> >> > > Is there a
> >> > > way to tell PaceMaker not to trigger a cluster failover if any of
> >> > > the p_mysql resources fail?
> >> > > >
> >> > >
> >> > > Let's look at this differently. If all these applications depend
> >> > > on each other, you should not be able to stop individual resource
> >> > > in the first place - you need to group them or define dependency
> >> > > so that stopping any resource would stop everything.
> >> > >
> >> > > If these applications are independent, they should not share
> >> > > resources.
> >> > > Each MySQL application should have own IP and own FS and own
> >> > > block device for this FS so that they can be moved between
> >> > > cluster nodes independently.
> >> > >
> >> > > Anything else will lead to troubles as you already observed.
> >> >
> >> > FYI, the MySQL services do not depend on each other. All of them
> >> > depend on the floating IP, which depends on the filesystem, which
> >> > depends on DRBD, but they do not depend on each other. Ideally, the
> >> > failure of p_mysql_002 should not cause failure of other mysql
> >> > resources, but now I understand why it happened. Pacemaker wanted
> >> > to start it on the other node, so it needed to move the floating
> >> > IP, filesystem, and DRBD primary, which had the cascade effect of
> >> > stopping the other MySQL resources.
> >> >
> >> > I think I also understand why the p_vip_clust01 resource blocked.
> >> >
> >> > FWIW, we've been using Linux HA since 2006, originally Heartbeat,
> >> > but then Corosync+Pacemaker. The past 12 years have been relatively
> >> > problem free. This symptom is new for us, only within the past year.
> >> > Our cluster nodes have many separate instances of MySQL running, so
> >> > it is not practical to have that many filesystems, IPs, etc. We are
> >> > content with the way things are, except for this new troubling
> >> > behavior.
> >> >
> >> > If I understand the thread correctly, op-fail=stop will not work
> >> > because the cluster will still try to stop the resources that are
> >> > implied dependencies.
> >> >
> >> > Bottom line is, how do we configure the cluster in such a way that
> >> > there are no cascading circumstances when a MySQL resource fails?
> >> > Basically, if a MySQL resource fails, it fails. We'll deal with
> >> > that on an ad-hoc basis. I don't want the whole cluster to barf.
> >> > What about op-fail=ignore? Earlier, you suggested symmetrical=false
> >> > might also do the trick, but you said it comes with its own can or worms.
> >> > What are the downsides with op-fail=ignore or asymmetrical=false?
> >> >
> >> > --Eric
> >>
> >> Even adding on-fail=ignore to the recurring monitors may not do what
> >> you want, because I suspect that even an ignored failure will make
> >> the node
> less
> >> preferable for all the other resources. But it's worth testing.
> >>
> >> Otherwise, your best option is to remove all the recurring monitors
> >> from
> the
> >> mysql resources, and rely on external monitoring (e.g. nagios,
> >> icinga,
> > monit,
> >> ...) to detect problems.
> >
> > This is probably a dumb question, but can we remove just the monitor
> > operation but leave the resource configured in the cluster? If a node
> > fails
>
> > over, we do want the resources to start automatically on the new
> > primary node.
>
> Actually I wonder whether this makes sense at all: IMHO a cluster ensures
> that the phone does not ring at night to make me perform some recovery
> operations after a failure. Once you move to manual start and stop of
> resources, I fail to see the reason for a cluster.
>
> When done well, independent resources should be configured (and
> managed) independently; otherwise they are dependent. There is no
> "middle-way".
>
> Regards,
> Ulrich
The following should show OK in a fixed font like Consolas, but the following setup is supposed to be possible, and is even referenced in the ClusterLabs documentation.
+--------------+
| mysql001 +--+
+--------------+ |
+--------------+ |
| mysql002 +--+
+--------------+ |
+--------------+ | +-------------+ +------------+ +----------+
| mysql003 +----->+ floating ip +-->+ filesystem +-->+ blockdev |
+--------------+ | +-------------+ +------------+ +----------+
+--------------+ |
| mysql004 +--+
+--------------+ |
+--------------+ |
| mysql005 +--+
+--------------+
In the layout above, the MySQL instances are dependent on the same underlying service stack, but they are not dependent on each other. Therefore, as I understand it, the failure of one MySQL instance should not cause the failure of other MySQL instances if on-fail=ignore on-fail=stop. At least, that’s the way it seems to me, but based on the thread, I guess it does not behave that way.
>
> >
> >> --
> >> Ken Gaillot <kgaillot at redhat.com<mailto:kgaillot at redhat.com>>
> >>
> >> _______________________________________________
> >> Users mailing list: Users at clusterlabs.org<mailto:Users at clusterlabs.org>
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> Project Home: http://www.clusterlabs.org Getting started:
> >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org<mailto:Users at clusterlabs.org>
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org<mailto:Users at clusterlabs.org>
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190220/c971599a/attachment-0002.html>
More information about the Users
mailing list