[ClusterLabs] Problems with master/slave failovers

Thu Jul 4 23:58:00 EDT 2019

I would tend to agree with you on this matter Andrei. To me it makes more sense for Pacemaker to prioritise maintaining a master over restarting a failed resource. If master scores are set in a sensible manner, the promoted master would immediately be given a high score and hence the other instance coming back online later would not cause a second failover. It becomes more difficult to maintain desired master scores if things depend on how long it takes for the failed resource to restart, and it then becomes a matter of timing as to whether or not Pacemaker causes the resource to failover or just restart and promote on the failed node. I'm pretty sure that's why I've been seeing the behaviour I've reported.

Regards,
Harvey

________________________________________
From: Users <users-bounces at clusterlabs.org> on behalf of Andrei Borzenkov <arvidjaar at gmail.com>
Sent: Wednesday, 3 July 2019 8:59 p.m.
To: Cluster Labs - All topics related to open-source clustering welcomed
Subject: EXTERNAL: Re: [ClusterLabs] Problems with master/slave failovers

On Wed, Jul 3, 2019 at 12:59 AM Ken Gaillot <kgaillot at redhat.com> wrote:
>
> On Mon, 2019-07-01 at 23:30 +0000, Harvey Shepherd wrote:
> > > The "transition summary" is just a resource-by-resource list, not
> > > the
> > > order things will be done. The "executing cluster transition"
> > > section
> > > is the order things are being done.
> >
> > Thanks Ken. I think that's where the problem is originating. If you
> > look at the "executing cluster transition" section, it's actually
> > restarting the failed king instance BEFORE promoting the remaining
> > in-service slave. When the failed resource comes back online, that
> > adjusts the master scores, resulting in the transition being aborted.
> > Both nodes then end up having the same master score for the king
> > resource, and Pacemaker decides to re-promote the original master. I
> > would have expected Pacemaker's priority to be to ensure that there
> > was a master available first, then to restart the failed instance in
> > slave mode. Is there a way to configure it to do that?
>
> No, that's intentional behavior. Starts are done before promotes so
> that promotion scores are in their final state before ultimately
> choosing the master.

There are applications that take tens of minutes to start while
failover is near to instantaneous.  Enforcing slave restart before
promoting means extended period of service unavailability. At the very
least this must be configurable.

> Otherwise, you'd end up in the same final
> situation, but the master would fail over first then fail back.
>

Now, really - while of course resource agent is free to throw dice to
decide master scores in real life in all cases I am familiar with
master score is decided by underlying application state. If agent
comes up and sees another instance running as master, it is highly
unlikely that agent will voluntarily force master away. And if it
happens I'd say agent is buggy and it is not pacemaker job to work
around it.

> It's up to the agent to set master scores in whatever fashion it
> considers ideal.
>

Except pacemaker makes started resource prerequisite for it. In real
life it may not even be possible to start former master before
re-configuring it.
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/