[ClusterLabs] failed resource resurection - failcount/cleanup etc ?

Fri Jul 12 08:33:14 EDT 2019

On 11/07/2019 14:16, Ken Gaillot wrote:
> On Thu, 2019-07-11 at 10:39 +0100, lejeczek wrote:
>> On 10/07/2019 15:50, Ken Gaillot wrote:
>>> On Wed, 2019-07-10 at 11:26 +0100, lejeczek wrote:
>>>> hi guys, possibly @devel if they pop in here.
>>>>
>>>> is there, will there be, a way to make cluster deal with failed
>>>> resources in such a way that cluster would try not to give up on
>>>> failed
>>>> resources?
>>>>
>>>> I understand that as of now the only way is  user's manual
>>>> intervention
>>>> (under which I'd include any scripted ways outside of the
>>>> cluster) if
>>>> we
>>>> need to bring back up a failed resource.
>>>>
>>>> many thanks, L.
>>> Not sure what you mean ... the default behavior is to try
>>> restarting a
>>> failed resource up to 1,000,000 times on the same node, then try
>>> starting it on a different node, and not give up until all nodes
>>> have
>>> failed to start it.
>>>
>>> This is affected by on-fail, migration-threshold, failure-timeout,
>>> and
>>> start-failure-is-fatal.
>>>
>>> If you're talking about a resource that failed because the entire
>>> node
>>> failed, then fencing comes into play.
>> Apologies for I was not clear enough while wording my question, I see
>> that now. When I said - make cluster deal with failed resources - I
>> meant a resource which failed in the (whole) cluster, failed on every
>> node.
>>
>> If that happens I see that only my (user manual) intervention can
>> make
>> cluster peep at the resource again and I wonder if this is me unaware
>> that there are ways it can be done, that cluster will not need me and
>> by
>> itself would do something, will not give up.
>>
>> My case is: a systemd resource which whether successful or not is
>> determined by a mechanism outside of the cluster, it can only
>> successfully start on one single node. When that node reboots then
>> cluster fails this resource, when that node rebooted and is up again
>> the
>> failed resource remains in failed state.
>>
>> Hopefully I manged to make it bit clearer this time.
>>
>> Many thanks, L.
> Ah, yes. failure-timeout is the only way to handle that. Keep in mind
> it is not guaranteed to be checked more frequently than the cluster-
> recheck-interval.

fantastic!

Is "cluster-recheck-interval" tough on the cluster? Is okey to take it
down from default 15min?

thanks, L.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: pEpkey.asc
Type: application/pgp-keys
Size: 1757 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190712/300d1324/attachment.bin>