[ClusterLabs Developers] Master scores ignored during cluster startup
Jehan-Guillaume de Rorthais
jgdr at dalibo.com
Mon Sep 18 09:37:35 EDT 2017
On Fri, 15 Sep 2017 14:32:49 -0500
Ken Gaillot <kgaillot at redhat.com> wrote:
> On Tue, 2017-08-29 at 16:50 -0500, Ken Gaillot wrote:
> > On Tue, 2017-08-29 at 22:55 +0200, Jehan-Guillaume de Rorthais wrote:
> > > Hi all,
> > >
> > > We discussed this issue with Ken Gaillot and Lars Ellenberg today
> > on IRC. This
> > > is just a sum up of the problem.
> > >
> > > Some users reported me that the PAF RA was not promoting the master
> > after a
> > > cluster startup. The problem is that PAF is using master score with
> > "-l
> > > forever", not transient ones disappearing on cluster reboot ("-l
> > reboot"). It
> > > expects the master scores to be present on cluster start.
> > >
> > > So basically, on cluster startup, the master score is just ignored
> > and no slave
> > > is being promoted...before the next cluster recheck (default to
> > 15min later).
> > >
> > > You could reproduce the issue using the RA attached to this email
> > and the
> > > following scenario:
> > >
> > > pcs cluster setup --name cluster_demo srv1 srv2
> > > pcs cluster start --all
> > > pcs cluster cib cluster1.xml
> > > pcs -f cluster1.xml property set stonith-enabled=false
> > > pcs -f cluster1.xml resource create stateful stateful-custom \
> > > op monitor interval=10s role="Master" \
> > > op monitor interval=11s role="Slave"
> > > pcs -f cluster1.xml resource master stateful-ha stateful
> > > pcs cluster cib-push cluster1.xml
> > > crm_master -r stateful -l forever -v 1
> > > # the master is elected and set its score to 10
> > > pcs cluster stop --all
> > > pcs cluster start --all
> > > # no master elected despite the master score
> > >
> > > Our discussion led to the theory that this might be correlated to
> > this
> > > commit where master score is ignored for a resource if it is not
> > started yet:
> > > https://github.com/beekhof/pacemaker/commit/65f1a22a4b66581159d8b74
> > 7dbd49fa5e2ef34e1
> > >
> > > If this is the actual issue, I suppose this should be improved to
> > detect if
> > > another master is existing or not before ignoring the master score.
> > Or maybe to
> > > detect if we are in a cluster startup or enabling the stateful
> > resource in the
> > > cluster.
> > >
> > > Thoughts?
> >
> > The issue has turned out to be more complicated, and unrelated to
> > that
> > commit. I see two problems: clone_name is NULL when master_score() is
> > called, so the code looks for master-whatever:0 rather than
> > master-whatever; and startup triggers the "has been filtered" block
> > of
> > code in master_score(), so we don't even get as far as checking
> > clone_name.
> >
> > My instinct for the second issue is that we should be able to skip
> > the
> > "has been filtered" short-circuit if the resource isn't running or
> > probed *anywhere* (as opposed to the particular node being checked).
> > I
> > haven't tracked down the first issue yet.
>
> This has been fixed in the upstream master branch.
>
> Commit 905ddd6 handles the second issue as described above. Commit
> dd5e271d handles the first issue by checking the master score using the
> resource name without the clone instance number if the resource has no
> LRM history. There's a new regression test to verify the correct
> behavior.
Thank you Ken for the fixes!
More information about the Developers
mailing list