[ClusterLabs] DLM recovery stuck

Thu Aug 9 15:30:05 EDT 2018

> If you mean dlm/clvmd_waiters, it's empty on all nodes.  Is there
> anything else to check?

I guess that might be the wrong thing to look at when it's recovery that's
blocked, my memory about this isn't great.  I think the clues to check for
recovery are mainly the dlm kernel messages and maybe:

  /sys/kernel/dlm/foo/recover_status
  (flags may indicate which message is being waited for)

  /sys/kernel/dlm/foo/recover_nodeid
  (which node a reply is needed from)

To eliminate userspace dlm_controld problems, look at dlm_controld debug
logs on each node and line up these steps from each of them:

clvmd check_ringid cluster 3724               (ringid needs to match)
clvmd start_kernel cg <N> member_count 6      (<N> will be different)
write "1" to "/sys/kernel/dlm/clvmd/control"
write "0" to "/sys/kernel/dlm/clvmd/event_done"

after this, follow the dlm kernel recovery messages, lining up the same
steps in parallel from each node.  The point at which they stop is the
recovery stage where a message didn't get through.  You can probably work
out which message between which nodes based on the sysfs files above.