[ClusterLabs] DLM recovery stuck
David Teigland
teigland at redhat.com
Thu Aug 9 15:30:05 EDT 2018
> If you mean dlm/clvmd_waiters, it's empty on all nodes. Is there
> anything else to check?
I guess that might be the wrong thing to look at when it's recovery that's
blocked, my memory about this isn't great. I think the clues to check for
recovery are mainly the dlm kernel messages and maybe:
/sys/kernel/dlm/foo/recover_status
(flags may indicate which message is being waited for)
/sys/kernel/dlm/foo/recover_nodeid
(which node a reply is needed from)
To eliminate userspace dlm_controld problems, look at dlm_controld debug
logs on each node and line up these steps from each of them:
clvmd check_ringid cluster 3724 (ringid needs to match)
clvmd start_kernel cg <N> member_count 6 (<N> will be different)
write "1" to "/sys/kernel/dlm/clvmd/control"
write "0" to "/sys/kernel/dlm/clvmd/event_done"
after this, follow the dlm kernel recovery messages, lining up the same
steps in parallel from each node. The point at which they stop is the
recovery stage where a message didn't get through. You can probably work
out which message between which nodes based on the sysfs files above.
More information about the Users
mailing list