[ClusterLabs] Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck

Mon Aug 28 12:07:28 EDT 2017

Hi,

In a 6-node cluster (vhbl03-08) the following happens 1-5 times a day
(in August; in May, it happened 0-2 times a day only, it's slowly
ramping up):

vhbl08 corosync[3687]:   [TOTEM ] A processor failed, forming new configuration.
vhbl03 corosync[3890]:   [TOTEM ] A processor failed, forming new configuration.
vhbl07 corosync[3805]:   [MAIN  ] Corosync main process was not scheduled for 4317.0054 ms (threshold is 2400.0000 ms). Consider token timeout increase.
vhbl07 corosync[3805]:   [TOTEM ] A processor failed, forming new configuration.
vhbl04 corosync[3759]:   [TOTEM ] A new membership (10.0.6.9:3056) was formed. Members
vhbl05 corosync[3919]:   [TOTEM ] A new membership (10.0.6.9:3056) was formed. Members
vhbl06 corosync[3759]:   [TOTEM ] A new membership (10.0.6.9:3056) was formed. Members
vhbl07 corosync[3805]:   [TOTEM ] A new membership (10.0.6.9:3056) was formed. Members
vhbl08 corosync[3687]:   [TOTEM ] A new membership (10.0.6.9:3056) was formed. Members
vhbl03 corosync[3890]:   [TOTEM ] A new membership (10.0.6.9:3056) was formed. Members
vhbl07 corosync[3805]:   [QUORUM] Members[6]: 167773705 167773706 167773707 167773708 167773709 167773710
vhbl08 corosync[3687]:   [QUORUM] Members[6]: 167773705 167773706 167773707 167773708 167773709 167773710
vhbl06 corosync[3759]:   [QUORUM] Members[6]: 167773705 167773706 167773707 167773708 167773709 167773710
vhbl07 corosync[3805]:   [MAIN  ] Completed service synchronization, ready to provide service.
vhbl04 corosync[3759]:   [QUORUM] Members[6]: 167773705 167773706 167773707 167773708 167773709 167773710
vhbl08 corosync[3687]:   [MAIN  ] Completed service synchronization, ready to provide service.
vhbl06 corosync[3759]:   [MAIN  ] Completed service synchronization, ready to provide service.
vhbl04 corosync[3759]:   [MAIN  ] Completed service synchronization, ready to provide service.
vhbl05 corosync[3919]:   [QUORUM] Members[6]: 167773705 167773706 167773707 167773708 167773709 167773710
vhbl03 corosync[3890]:   [QUORUM] Members[6]: 167773705 167773706 167773707 167773708 167773709 167773710
vhbl05 corosync[3919]:   [MAIN  ] Completed service synchronization, ready to provide service.
vhbl03 corosync[3890]:   [MAIN  ] Completed service synchronization, ready to provide service.

The cluster is running Corosync 2.4.2 multicast.  Those lines really end
at Members, there are no joined of left nodes listed.  Pacemaker on top
reacts like:

[9982] vhbl03 pacemakerd:     info: pcmk_quorum_notification:   Quorum retained | membership=3056 members=6
[9991] vhbl03       crmd:     info: pcmk_quorum_notification:   Quorum retained | membership=3056 members=6
[9986] vhbl03        cib:     info: cib_process_request:        Completed cib_modify operation for section nodes: OK (rc=0, origin=vhbl07/crmd/4477, version=0.1694.12)
[9986] vhbl03        cib:     info: cib_process_request:        Completed cib_modify operation for section status: OK (rc=0, origin=vhbl07/crmd/4478, version=0.1694.12)
[9986] vhbl03        cib:     info: cib_process_ping:   Reporting our current digest to vhbl07: 85250f3039d269f96012750f13e232d9 for 0.1694.12 (0x55ef057447d0 0)

on all nodes except for vhbl07, where it says:

[9886] vhbl07       crmd:     info: pcmk_quorum_notification:   Quorum retained | membership=3056 members=6
[9877] vhbl07 pacemakerd:     info: pcmk_quorum_notification:   Quorum retained | membership=3056 members=6
[9881] vhbl07        cib:     info: cib_process_request:        Forwarding cib_modify operation for section nodes to all (origin=local/crmd/
[9881] vhbl07        cib:     info: cib_process_request:        Forwarding cib_modify operation for section status to all (origin=local/crmd
[9881] vhbl07        cib:     info: cib_process_request:        Completed cib_modify operation for section nodes: OK (rc=0, origin=vhbl07/cr
[9881] vhbl07        cib:     info: cib_process_request:        Completed cib_modify operation for section status: OK (rc=0, origin=vhbl07/c
[9881] vhbl07        cib:     info: cib_process_ping:   Reporting our current digest to vhbl07: 85250f3039d269f96012750f13e232d9 for 0.1694.

So Pacemaker does nothing, basically, and I can't see any adverse effect
to resource management, but DLM seems to have some problem, which may or
may not be related.  When the TOTEM error appears, all nodes log this:

vhbl03 dlm_controld[3914]: 2801675 dlm:controld ring 167773705:3056 6 memb 167773705 167773706 167773707 167773708 167773709 167773710
vhbl03 dlm_controld[3914]: 2801675 fence work wait for cluster ringid
vhbl03 dlm_controld[3914]: 2801675 dlm:ls:clvmd ring 167773705:3056 6 memb 167773705 167773706 167773707 167773708 167773709 167773710
vhbl03 dlm_controld[3914]: 2801675 clvmd wait_messages cg 9 need 1 of 6
vhbl03 dlm_controld[3914]: 2801675 fence work wait for cluster ringid
vhbl03 dlm_controld[3914]: 2801675 cluster quorum 1 seq 3056 nodes 6

dlm_controld is running with --enable_fencing=0.  Pacemaker does its own
fencing if resource management requires it, but DLM is used by cLVM
only, which does not warrant such harsh measures.  Right now cLVM is
blocked; I don't know since when, because we seldom do cLVM operations
on this cluster.  My immediate aim is to unblock cLVM somehow.

While dlm_tool status reports (similar on all nodes):

cluster nodeid 167773705 quorate 1 ring seq 3088 3088
daemon now 2941405 fence_pid 0 
node 167773705 M add 196 rem 0 fail 0 fence 0 at 0 0
node 167773706 M add 5960 rem 5730 fail 0 fence 0 at 0 0
node 167773707 M add 2089 rem 1802 fail 0 fence 0 at 0 0
node 167773708 M add 3646 rem 3413 fail 0 fence 0 at 0 0
node 167773709 M add 2588921 rem 2588920 fail 0 fence 0 at 0 0
node 167773710 M add 196 rem 0 fail 0 fence 0 at 0 0

dlm_tool ls shows "kern_stop":

dlm lockspaces
name          clvmd
id            0x4104eefa
flags         0x00000004 kern_stop
change        member 5 joined 0 remove 1 failed 1 seq 8,8
members       167773705 167773706 167773707 167773708 167773710 
new change    member 6 joined 1 remove 0 failed 0 seq 9,9
new status    wait messages 1
new members   167773705 167773706 167773707 167773708 167773709 167773710 

on all nodes except for vhbl07 (167773709), where it gives

dlm lockspaces
name          clvmd
id            0x4104eefa
flags         0x00000000 
change        member 6 joined 1 remove 0 failed 0 seq 11,11
members       167773705 167773706 167773707 167773708 167773709 167773710 

instead.

Does anybody have an idea what the problem(s) might be?  Why is Corosync
deteriorating on this cluster?  (It's running with RR PRIO 99.)  Could
that have hurt DLM?  Is there a way to unblock DLM without rebooting all
nodes?  (Actually, rebooting is problematic in itself with blocked cLVM,
but that's tractable.)
-- 
Thanks,
Feri