[ClusterLabs] Antw: Retries before setting fail-count to INFINITY

Mon Aug 21 10:54:10 EDT 2017

On Mon, 2017-08-21 at 15:39 +0200, Ulrich Windl wrote:
> >>> Vaibhaw Pandey <vabu.vayu at gmail.com> schrieb am 21.08.2017 um 14:58 in
> Nachricht
> <CAAdwLTsZMX5fD=RsA7k1DKgMKoZ51A0jM=Hay4rUB4EF44Z7PA at mail.gmail.com>:
> > Version in use: 1.1 along with corosync 1.4
> > 
> > Hello,
> > I am new to pacemaker and was trying to setup a MySQL master/slave cluster
> > using pacemaker and had a question on resource failure response which I
> > couldn't resolve from the documentation.
> > 
> > The pacemaker doc (
> > https://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_fa 
> > ilure_response.html)
> > says clearly that:
> > 
> > "Normally, if a running resource fails, pacemaker will try to stop it and
> > start it again."
> > 
> > I was wondering if there is a way to configure the # of times pacemaker
> > will attempt this start and stop sequence - we want to try and restart the
> > resource 2 or 3 times before it is stopped. Obviously setting a
> 
> Maybe you misunderstood: A stopped resource is the precondition for a successful start. So before any start attempt of a failed resource comes a stop attempt. If your monitor times out, try to increase the monitor timeout; it it causes false alerts, fix the monitor. If the database is crap, replace the database ;-)

Agreed, the ideal solution here is to fix the monitor. (It is free to
try 2 or 3 times before returning a result.)

FYI, there is a planned overhaul of pacemaker's failure handling that
would give this capability. The new options would allow you to say
"ignore this many failures, then try restarting this many times, then do
this hard recovery action". However, there's no time frame for when that
will arrive.

> > migration-threshold doesn't work in this case because the moment the 1st
> > attempt to restart the resource fails, fail-count is set to INFINITY. Our
> > failure-timeout is set to default (0).
> 
> Yes, the cluster cannot predict the future: If the resource failed to start, it's unlikely that repeating the same thing will suddenly succeed. It's more likely that the start will suceed elesewhere (disregarding configuration errors).
> 
> > 
> > The reason we wish to do this is that, at times the database is busy and
> > the monitor action fails. However there is a good chance it might succeed
> > on a second or third attempt.
> 
> "it" is "monitor" operation?
> 
> > 
> > Is there a parameter in pacemaker that we can utilize to cause this
> > behavior or will this have to be coded in the resource agent?
> 
> See above.
> 
> > 
> > Thanks,
> > Vaibhaw
> 
> 
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-- 
Ken Gaillot <kgaillot at redhat.com>