[ClusterLabs] Antw: Re: Why Do All The Services Go Down When Just One Fails?

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Wed Feb 20 02:34:47 EST 2019


>>> Eric Robinson <eric.robinson at psmnv.com> schrieb am 19.02.2019 um 21:06 in
Nachricht
<MN2PR03MB4845BE22FADA30B472174B79FA7C0 at MN2PR03MB4845.namprd03.prod.outlook.com>

>>  -----Original Message-----
>> From: Users <users-bounces at clusterlabs.org> On Behalf Of Ken Gaillot
>> Sent: Tuesday, February 19, 2019 10:31 AM
>> To: Cluster Labs - All topics related to open-source clustering welcomed
>> <users at clusterlabs.org>
>> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One
>> Fails?
>> 
>> On Tue, 2019-02-19 at 17:40 +0000, Eric Robinson wrote:
>> > > -----Original Message-----
>> > > From: Users <users-bounces at clusterlabs.org> On Behalf Of Andrei
>> > > Borzenkov
>> > > Sent: Sunday, February 17, 2019 11:56 AM
>> > > To: users at clusterlabs.org 
>> > > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just
>> > > One Fails?
>> > >
>> > > 17.02.2019 0:44, Eric Robinson пишет:
>> > > > Thanks for the feedback, Andrei.
>> > > >
>> > > > I only want cluster failover to occur if the filesystem or drbd
>> > > > resources fail,
>> > >
>> > > or if the cluster messaging layer detects a complete node failure.
>> > > Is there a
>> > > way to tell PaceMaker not to trigger a cluster failover if any of
>> > > the p_mysql resources fail?
>> > > >
>> > >
>> > > Let's look at this differently. If all these applications depend on
>> > > each other, you should not be able to stop individual resource in
>> > > the first place - you need to group them or define dependency so
>> > > that stopping any resource would stop everything.
>> > >
>> > > If these applications are independent, they should not share
>> > > resources.
>> > > Each MySQL application should have own IP and own FS and own block
>> > > device for this FS so that they can be moved between cluster nodes
>> > > independently.
>> > >
>> > > Anything else will lead to troubles as you already observed.
>> >
>> > FYI, the MySQL services do not depend on each other. All of them
>> > depend on the floating IP, which depends on the filesystem, which
>> > depends on DRBD, but they do not depend on each other. Ideally, the
>> > failure of p_mysql_002 should not cause failure of other mysql
>> > resources, but now I understand why it happened. Pacemaker wanted to
>> > start it on the other node, so it needed to move the floating IP,
>> > filesystem, and DRBD primary, which had the cascade effect of stopping
>> > the other MySQL resources.
>> >
>> > I think I also understand why the p_vip_clust01 resource blocked.
>> >
>> > FWIW, we've been using Linux HA since 2006, originally Heartbeat, but
>> > then Corosync+Pacemaker. The past 12 years have been relatively
>> > problem free. This symptom is new for us, only within the past year.
>> > Our cluster nodes have many separate instances of MySQL running, so it
>> > is not practical to have that many filesystems, IPs, etc. We are
>> > content with the way things are, except for this new troubling
>> > behavior.
>> >
>> > If I understand the thread correctly, op-fail=stop will not work
>> > because the cluster will still try to stop the resources that are
>> > implied dependencies.
>> >
>> > Bottom line is, how do we configure the cluster in such a way that
>> > there are no cascading circumstances when a MySQL resource fails?
>> > Basically, if a MySQL resource fails, it fails. We'll deal with that
>> > on an ad-hoc basis. I don't want the whole cluster to barf. What about
>> > op-fail=ignore? Earlier, you suggested symmetrical=false might also do
>> > the trick, but you said it comes with its own can or worms.
>> > What are the downsides with op-fail=ignore or asymmetrical=false?
>> >
>> > --Eric
>> 
>> Even adding on-fail=ignore to the recurring monitors may not do what you
>> want, because I suspect that even an ignored failure will make the node
less
>> preferable for all the other resources. But it's worth testing.
>> 
>> Otherwise, your best option is to remove all the recurring monitors from
the
>> mysql resources, and rely on external monitoring (e.g. nagios, icinga, 
> monit,
>> ...) to detect problems.
> 
> This is probably a dumb question, but can we remove just the monitor 
> operation but leave the resource configured in the cluster? If a node fails

> over, we do want the resources to start automatically on the new primary 
> node.

Actually I wonder whether this makes sense at all: IMHO a cluster ensures that
the phone does not ring at night to make me perform some recovery operations
after a failure. Once you move to manual start and stop of resources, I fail to
see the reason for a cluster.

When done well, independent resources should be configured (and managed)
independently; otherwise they are dependent. There is no "middle-way".

Regards,
Ulrich

> 
>> --
>> Ken Gaillot <kgaillot at redhat.com>
>> 
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org 
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> Project Home: http://www.clusterlabs.org Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 






More information about the Users mailing list