[ClusterLabs] dlm_controld does not recover from failed lockspace join

Tue Jan 8 11:48:19 EST 2019

Hello,

We've seen an issue in production where DLM 4.0.7 gets "stuck" and
unable to join more lockspaces. Other nodes in the cluster were able to
join new lockspaces, but not the one that node 1 was stuck on.
GFS2 was unaffected (the "stuck" lockspace was for a userspace control
daemon, but thats just luck, it could've been GFS2's lockspace).

I do not have repro steps for this yet, but analyzing the kernel and
dlm_controld logs I think I found the root cause:

dlm_controld[7104]: 10998 fence work wait for quorum
dlm_controld[7104]: 11000 xapi-clusterd-lockspace wait for quorum
dlm_controld[7104]: 14602 fence work wait for quorum
dlm_controld[7104]: 14604 xapi-clusterd-lockspace wait for quorum
[15419.173125] dlm: xapi-clusterd-lockspace: group event done -512 0
[15419.173135] dlm: xapi-clusterd-lockspace: group join failed -512 0
dlm_controld[7104]: 15366 process_uevent online@ error -17 errno 0
...
[16080.892629] dlm: xapi-clusterd-lockspace: group event done -512 0
[16080.892638] dlm: xapi-clusterd-lockspace: group join failed -512 0
[16080.893156] dlm: cannot start dlm_scand thread -4
dlm_controld[7104]: 16087 cdab491e-14c8-ab wait for quorum
...
dlm_controld[7104]: 18199 fence work wait for quorum
dlm_controld[7104]: 18201 xapi-clusterd-lockspace wait for quorum
[19551.164358] dlm: xapi-clusterd-lockspace: joining the lockspace group...
dlm_controld[7104]: 19320 open
"/sys/kernel/dlm/xapi-clusterd-lockspace/id" error -1 2
dlm_controld[7104]: 19320 open
"/sys/kernel/dlm/xapi-clusterd-lockspace/control" error -1 2
dlm_controld[7104]: 19320 open
"/sys/kernel/dlm/xapi-clusterd-lockspace/event_done" error -1 2
dlm_controld[7104]: 19321 open
"/sys/kernel/dlm/xapi-clusterd-lockspace/control" error -1 2
dlm_controld[7104]: 19321 open
"/sys/kernel/dlm/xapi-clusterd-lockspace/control" error -1 2
dlm_controld[7104]: 19495 process_uevent online@ error -17 errno 2
...
[19551.455848] dlm: invalid lockspace 2844031955 from 2 cmd 2 type 1
[19552.459852] dlm: invalid lockspace 2844031955 from 2 cmd 2 type 1

And on another host from the cluster:
[41373.794149] dlm: xapi-clusterd-lockspace: remote node 1 not ready

Errno 512 is ERESTARTSYS in the kernel.
errno 17 is EEXIST, and looking through the source code it looks like it
is raised here in main.c:
if (!strcmp(act, "online@")) {

  >       >       ls = find_ls(argv[3]);

  >       >       if (ls) {

  >       >       >       rv = -EEXIST;

  >       >       >       goto out;

  >       >       }

find_ls() looks at a global &lockspaces variable, which AFAICT is only
ever added to, but never removed from:
dlm_controld/cpg.c:     list_for_each_entry(ls, &lockspaces, list) {
dlm_controld/cpg.c:     list_for_each_entry(ls, &lockspaces, list) {
dlm_controld/cpg.c:     list_for_each_entry_safe(ls, safe, &lockspaces,
list) {
dlm_controld/cpg.c:     list_add(&ls->list, &lockspaces);
dlm_controld/cpg.c:     list_for_each_entry(ls, &lockspaces, list)
dlm_controld/cpg.c:     list_for_each_entry(ls, &lockspaces, list) {
dlm_controld/daemon_cpg.c:      list_for_each_entry(ls, &lockspaces, list) {
dlm_controld/dlm_daemon.h:EXTERN struct list_head lockspaces;
dlm_controld/main.c:    list_for_each_entry(ls, &lockspaces, list) {
dlm_controld/main.c:    list_for_each_entry(ls, &lockspaces, list) {
dlm_controld/main.c:                    if (daemon_quit &&
list_empty(&lockspaces)) {
dlm_controld/main.c:    list_for_each_entry(ls, &lockspaces, list)
dlm_controld/member.c:          if (list_empty(&lockspaces)) {
dlm_controld/plock.c:   list_for_each_entry(ls, &lockspaces, list) {

So if joining the lockspace fails, then DLM would forever refuse to join
that lockspace (until the host is rebooted): DLM's list of lockspaces is
now out of sync with the kernel.

How should dlm_controld recover from such an error? Could it refresh its
list of lockspaces from the kernel if a join failed?

Best regards,
--Edwin