[ClusterLabs] DLM recovery stuck

Thu Aug 9 17:20:04 UTC 2018

On Thu, Aug 09, 2018 at 06:11:48PM +0200, Ferenc Wágner wrote:
> Hi David,
> 
> Almost ten years ago you requested more info in a similar case, let's
> see if we can get further now!

Hi, the usual cause is that a network message from the dlm has been
lost/dropped/missed.  The dlm can't recover from that, which is clearly a
weak point in the design.  There may be some new development coming along
to finally improve that.

One way you can confirm this is to check if the dlm on one or more nodes
is waiting for a message that's not arriving.  Often you'll see an entry
in the dlm "waiters" debugfs file corresponding to a response that's being
waited on.

Another red flag is kernel messages from a driver indicating some network
hickup at the time things hung.  I can't say if these messages you sent
happened at the right time, or if they even correspond to the dlm
interface, but it's worth checking as a possible explanation:

[  137.207059] be2net 0000:05:00.0 enp5s0f0: Link is Up
[  137.252901] be2net 0000:05:00.1 enp5s0f1: Link is Up

[  153.886619]  connection2:0: detected conn error (1011)