[ClusterLabs] failed resource resurection - failcount/cleanup etc ?

Thu Jul 11 09:16:20 EDT 2019

On Thu, 2019-07-11 at 10:39 +0100, lejeczek wrote:
> On 10/07/2019 15:50, Ken Gaillot wrote:
> > On Wed, 2019-07-10 at 11:26 +0100, lejeczek wrote:
> > > hi guys, possibly @devel if they pop in here.
> > > 
> > > is there, will there be, a way to make cluster deal with failed
> > > resources in such a way that cluster would try not to give up on
> > > failed
> > > resources?
> > > 
> > > I understand that as of now the only way is  user's manual
> > > intervention
> > > (under which I'd include any scripted ways outside of the
> > > cluster) if
> > > we
> > > need to bring back up a failed resource.
> > > 
> > > many thanks, L.
> > 
> > Not sure what you mean ... the default behavior is to try
> > restarting a
> > failed resource up to 1,000,000 times on the same node, then try
> > starting it on a different node, and not give up until all nodes
> > have
> > failed to start it.
> > 
> > This is affected by on-fail, migration-threshold, failure-timeout,
> > and
> > start-failure-is-fatal.
> > 
> > If you're talking about a resource that failed because the entire
> > node
> > failed, then fencing comes into play.
> 
> Apologies for I was not clear enough while wording my question, I see
> that now. When I said - make cluster deal with failed resources - I
> meant a resource which failed in the (whole) cluster, failed on every
> node.
> 
> If that happens I see that only my (user manual) intervention can
> make
> cluster peep at the resource again and I wonder if this is me unaware
> that there are ways it can be done, that cluster will not need me and
> by
> itself would do something, will not give up.
> 
> My case is: a systemd resource which whether successful or not is
> determined by a mechanism outside of the cluster, it can only
> successfully start on one single node. When that node reboots then
> cluster fails this resource, when that node rebooted and is up again
> the
> failed resource remains in failed state.
> 
> Hopefully I manged to make it bit clearer this time.
> 
> Many thanks, L.

Ah, yes. failure-timeout is the only way to handle that. Keep in mind
it is not guaranteed to be checked more frequently than the cluster-
recheck-interval.
-- 
Ken Gaillot <kgaillot at redhat.com>