[ClusterLabs] failed resource resurection - failcount/cleanup etc ?

Fri Jul 12 09:58:42 EDT 2019

On Fri, 2019-07-12 at 13:33 +0100, lejeczek wrote:
> On 11/07/2019 14:16, Ken Gaillot wrote:
> > On Thu, 2019-07-11 at 10:39 +0100, lejeczek wrote:
> > > On 10/07/2019 15:50, Ken Gaillot wrote:
> > > > On Wed, 2019-07-10 at 11:26 +0100, lejeczek wrote:
> > > > > hi guys, possibly @devel if they pop in here.
> > > > > 
> > > > > is there, will there be, a way to make cluster deal with
> > > > > failed
> > > > > resources in such a way that cluster would try not to give up
> > > > > on
> > > > > failed
> > > > > resources?
> > > > > 
> > > > > I understand that as of now the only way is  user's manual
> > > > > intervention
> > > > > (under which I'd include any scripted ways outside of the
> > > > > cluster) if
> > > > > we
> > > > > need to bring back up a failed resource.
> > > > > 
> > > > > many thanks, L.
> > > > 
> > > > Not sure what you mean ... the default behavior is to try
> > > > restarting a
> > > > failed resource up to 1,000,000 times on the same node, then
> > > > try
> > > > starting it on a different node, and not give up until all
> > > > nodes
> > > > have
> > > > failed to start it.
> > > > 
> > > > This is affected by on-fail, migration-threshold, failure-
> > > > timeout,
> > > > and
> > > > start-failure-is-fatal.
> > > > 
> > > > If you're talking about a resource that failed because the
> > > > entire
> > > > node
> > > > failed, then fencing comes into play.
> > > 
> > > Apologies for I was not clear enough while wording my question, I
> > > see
> > > that now. When I said - make cluster deal with failed resources -
> > > I
> > > meant a resource which failed in the (whole) cluster, failed on
> > > every
> > > node.
> > > 
> > > If that happens I see that only my (user manual) intervention can
> > > make
> > > cluster peep at the resource again and I wonder if this is me
> > > unaware
> > > that there are ways it can be done, that cluster will not need me
> > > and
> > > by
> > > itself would do something, will not give up.
> > > 
> > > My case is: a systemd resource which whether successful or not is
> > > determined by a mechanism outside of the cluster, it can only
> > > successfully start on one single node. When that node reboots
> > > then
> > > cluster fails this resource, when that node rebooted and is up
> > > again
> > > the
> > > failed resource remains in failed state.
> > > 
> > > Hopefully I manged to make it bit clearer this time.
> > > 
> > > Many thanks, L.
> > 
> > Ah, yes. failure-timeout is the only way to handle that. Keep in
> > mind
> > it is not guaranteed to be checked more frequently than the
> > cluster-
> > recheck-interval.
> 
> fantastic!
> 
> Is "cluster-recheck-interval" tough on the cluster? Is okey to take
> it
> down from default 15min?
> 
> thanks, L.

Certainly 5min is fine. I've seen users take it down as far as 1min,
although that makes me uneasy for no defined reason. It's not a lot of
overhead -- you can run "time crm_simulate -SL" to get an idea of what
it takes (plus increasing logs somewhat).
-- 
Ken Gaillot <kgaillot at redhat.com>