[ClusterLabs] Node is silently unfenced if transition is very long

Tue Jun 21 16:19:58 UTC 2016

On 06/17/2016 07:05 AM, Vladislav Bogdanov wrote:
> 03.05.2016 01:14, Ken Gaillot wrote:
>> On 04/19/2016 10:47 AM, Vladislav Bogdanov wrote:
>>> Hi,
>>>
>>> Just found an issue with node is silently unfenced.
>>>
>>> That is quite large setup (2 cluster nodes and 8 remote ones) with
>>> a plenty of slowly starting resources (lustre filesystem).
>>>
>>> Fencing was initiated due to resource stop failure.
>>> lustre often starts very slowly due to internal recovery, and some such
>>> resources were starting in that transition where another resource
>>> failed to stop.
>>> And, as transition did not finish in time specified by the
>>> "failure-timeout" (set to 9 min), and was not aborted, that stop
>>> failure was successfully cleaned.
>>> There were transition aborts due to attribute changes, after that
>>> stop failure happened, but fencing
>>> was not initiated for some reason.
>>
>> Unfortunately, that makes sense with the current code. Failure timeout
>> changes the node attribute, which aborts the transition, which causes a
>> recalculation based on the new state, and the fencing is no longer
> 
> Ken, could this one be considered to be fixed before 1.1.15 is released?

I'm planning to release 1.1.15 later today, and this won't make it in.

We do have several important open issues, including this one, but I
don't want them to delay the release of the many fixes that are ready to
go. I would only hold for a significant issue introduced this cycle, and
none of the known issues appear to qualify.

> I was just hit by the same in the completely different setup.
> Two-node cluster, one node fails to stop a resource, and is fenced.
> Right after that second node fails to activate clvm volume (different
> story, need to investigate) and then fails to stop it. Node is scheduled
> to be fenced, but it cannot be because first node didn't come up yet.
> Any cleanup (automatic or manual) of a resource failed to stop clears
> node state, removing "unclean" state from a node. That is probably not
> what I could expect (resource cleanup is a node unfence)...
> Honestly, this potentially leads to a data corruption...
> 
> Also (probably not related) there was one more resource stop failure (in
> that case - timeout) prior to failed stop mentioned above. And that stop
> timeout did not lead to fencing by itself.
> 
> I have logs (but not pe-inputs/traces/blackboxes) from both nodes, so
> any additional information from them can be easily provided.
> 
> Best regards,
> Vladislav