[ClusterLabs] Node is silently unfenced if transition is very long

Vladislav Bogdanov bubble at hoster-ok.com
Fri Jun 17 10:23:50 EDT 2016


17.06.2016 15:05, Vladislav Bogdanov wrote:
> 03.05.2016 01:14, Ken Gaillot wrote:
>> On 04/19/2016 10:47 AM, Vladislav Bogdanov wrote:
>>> Hi,
>>>
>>> Just found an issue with node is silently unfenced.
>>>
>>> That is quite large setup (2 cluster nodes and 8 remote ones) with
>>> a plenty of slowly starting resources (lustre filesystem).
>>>
>>> Fencing was initiated due to resource stop failure.
>>> lustre often starts very slowly due to internal recovery, and some such
>>> resources were starting in that transition where another resource
>>> failed to stop.
>>> And, as transition did not finish in time specified by the
>>> "failure-timeout" (set to 9 min), and was not aborted, that stop
>>> failure was successfully cleaned.
>>> There were transition aborts due to attribute changes, after that
>>> stop failure happened, but fencing
>>> was not initiated for some reason.
>>
>> Unfortunately, that makes sense with the current code. Failure timeout
>> changes the node attribute, which aborts the transition, which causes a
>> recalculation based on the new state, and the fencing is no longer
>
> Ken, could this one be considered to be fixed before 1.1.15 is released?

I created https://github.com/ClusterLabs/pacemaker/pull/1072 for this
That is RFC, tested only to compile.
I hope that should be correct, please tell me if I do something damn 
wrong, or if there could be a better way.

Best,
Vladislav





More information about the Users mailing list