[ClusterLabs] Node is silently unfenced if transition is very long

Fri Jun 17 12:05:48 UTC 2016

03.05.2016 01:14, Ken Gaillot wrote:
> On 04/19/2016 10:47 AM, Vladislav Bogdanov wrote:
>> Hi,
>>
>> Just found an issue with node is silently unfenced.
>>
>> That is quite large setup (2 cluster nodes and 8 remote ones) with
>> a plenty of slowly starting resources (lustre filesystem).
>>
>> Fencing was initiated due to resource stop failure.
>> lustre often starts very slowly due to internal recovery, and some such
>> resources were starting in that transition where another resource failed to stop.
>> And, as transition did not finish in time specified by the
>> "failure-timeout" (set to 9 min), and was not aborted, that stop failure was successfully cleaned.
>> There were transition aborts due to attribute changes, after that stop failure happened, but fencing
>> was not initiated for some reason.
>
> Unfortunately, that makes sense with the current code. Failure timeout
> changes the node attribute, which aborts the transition, which causes a
> recalculation based on the new state, and the fencing is no longer

Ken, could this one be considered to be fixed before 1.1.15 is released?
I was just hit by the same in the completely different setup.
Two-node cluster, one node fails to stop a resource, and is fenced. 
Right after that second node fails to activate clvm volume (different 
story, need to investigate) and then fails to stop it. Node is scheduled 
to be fenced, but it cannot be because first node didn't come up yet.
Any cleanup (automatic or manual) of a resource failed to stop clears 
node state, removing "unclean" state from a node. That is probably not 
what I could expect (resource cleanup is a node unfence)...
Honestly, this potentially leads to a data corruption...

Also (probably not related) there was one more resource stop failure (in 
that case - timeout) prior to failed stop mentioned above. And that stop 
timeout did not lead to fencing by itself.

I have logs (but not pe-inputs/traces/blackboxes) from both nodes, so 
any additional information from them can be easily provided.

Best regards,
Vladislav