[ClusterLabs] Node is silently unfenced if transition is very long

Tue Jun 21 12:24:06 EDT 2016

On 21/06/16 12:19 PM, Ken Gaillot wrote:
> On 06/17/2016 07:05 AM, Vladislav Bogdanov wrote:
>> 03.05.2016 01:14, Ken Gaillot wrote:
>>> On 04/19/2016 10:47 AM, Vladislav Bogdanov wrote:
>>>> Hi,
>>>>
>>>> Just found an issue with node is silently unfenced.
>>>>
>>>> That is quite large setup (2 cluster nodes and 8 remote ones) with
>>>> a plenty of slowly starting resources (lustre filesystem).
>>>>
>>>> Fencing was initiated due to resource stop failure.
>>>> lustre often starts very slowly due to internal recovery, and some such
>>>> resources were starting in that transition where another resource
>>>> failed to stop.
>>>> And, as transition did not finish in time specified by the
>>>> "failure-timeout" (set to 9 min), and was not aborted, that stop
>>>> failure was successfully cleaned.
>>>> There were transition aborts due to attribute changes, after that
>>>> stop failure happened, but fencing
>>>> was not initiated for some reason.
>>>
>>> Unfortunately, that makes sense with the current code. Failure timeout
>>> changes the node attribute, which aborts the transition, which causes a
>>> recalculation based on the new state, and the fencing is no longer
>>
>> Ken, could this one be considered to be fixed before 1.1.15 is released?
> 
> I'm planning to release 1.1.15 later today, and this won't make it in.
> 
> We do have several important open issues, including this one, but I
> don't want them to delay the release of the many fixes that are ready to
> go. I would only hold for a significant issue introduced this cycle, and
> none of the known issues appear to qualify.

I wonder if it would be worth appending a "known bugs/TODO" list to the
release announcements? Partly as a "heads-up" and partly as a way to
show folks what might be coming in .x+1.

>> I was just hit by the same in the completely different setup.
>> Two-node cluster, one node fails to stop a resource, and is fenced.
>> Right after that second node fails to activate clvm volume (different
>> story, need to investigate) and then fails to stop it. Node is scheduled
>> to be fenced, but it cannot be because first node didn't come up yet.
>> Any cleanup (automatic or manual) of a resource failed to stop clears
>> node state, removing "unclean" state from a node. That is probably not
>> what I could expect (resource cleanup is a node unfence)...
>> Honestly, this potentially leads to a data corruption...
>>
>> Also (probably not related) there was one more resource stop failure (in
>> that case - timeout) prior to failed stop mentioned above. And that stop
>> timeout did not lead to fencing by itself.
>>
>> I have logs (but not pe-inputs/traces/blackboxes) from both nodes, so
>> any additional information from them can be easily provided.
>>
>> Best regards,
>> Vladislav

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?