[ClusterLabs] failed resource resurection - failcount/cleanup etc ?
peljasz at yahoo.co.uk
Fri Jul 12 08:33:14 EDT 2019
On 11/07/2019 14:16, Ken Gaillot wrote:
> On Thu, 2019-07-11 at 10:39 +0100, lejeczek wrote:
>> On 10/07/2019 15:50, Ken Gaillot wrote:
>>> On Wed, 2019-07-10 at 11:26 +0100, lejeczek wrote:
>>>> hi guys, possibly @devel if they pop in here.
>>>> is there, will there be, a way to make cluster deal with failed
>>>> resources in such a way that cluster would try not to give up on
>>>> I understand that as of now the only way is user's manual
>>>> (under which I'd include any scripted ways outside of the
>>>> cluster) if
>>>> need to bring back up a failed resource.
>>>> many thanks, L.
>>> Not sure what you mean ... the default behavior is to try
>>> restarting a
>>> failed resource up to 1,000,000 times on the same node, then try
>>> starting it on a different node, and not give up until all nodes
>>> failed to start it.
>>> This is affected by on-fail, migration-threshold, failure-timeout,
>>> If you're talking about a resource that failed because the entire
>>> failed, then fencing comes into play.
>> Apologies for I was not clear enough while wording my question, I see
>> that now. When I said - make cluster deal with failed resources - I
>> meant a resource which failed in the (whole) cluster, failed on every
>> If that happens I see that only my (user manual) intervention can
>> cluster peep at the resource again and I wonder if this is me unaware
>> that there are ways it can be done, that cluster will not need me and
>> itself would do something, will not give up.
>> My case is: a systemd resource which whether successful or not is
>> determined by a mechanism outside of the cluster, it can only
>> successfully start on one single node. When that node reboots then
>> cluster fails this resource, when that node rebooted and is up again
>> failed resource remains in failed state.
>> Hopefully I manged to make it bit clearer this time.
>> Many thanks, L.
> Ah, yes. failure-timeout is the only way to handle that. Keep in mind
> it is not guaranteed to be checked more frequently than the cluster-
Is "cluster-recheck-interval" tough on the cluster? Is okey to take it
down from default 15min?
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 1757 bytes
Desc: not available
More information about the Users