[ClusterLabs] Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck

Mon Aug 28 14:20:56 EDT 2017

On 2017-08-28 12:07 PM, Ferenc Wágner wrote:
> Hi,
> 
> In a 6-node cluster (vhbl03-08) the following happens 1-5 times a day
> (in August; in May, it happened 0-2 times a day only, it's slowly
> ramping up):
> 
> vhbl08 corosync[3687]:   [TOTEM ] A processor failed, forming new configuration.
> vhbl03 corosync[3890]:   [TOTEM ] A processor failed, forming new configuration.
> vhbl07 corosync[3805]:   [MAIN  ] Corosync main process was not scheduled for 4317.0054 ms (threshold is 2400.0000 ms). Consider token timeout increase.
> vhbl07 corosync[3805]:   [TOTEM ] A processor failed, forming new configuration.
> vhbl04 corosync[3759]:   [TOTEM ] A new membership (10.0.6.9:3056) was formed. Members
> vhbl05 corosync[3919]:   [TOTEM ] A new membership (10.0.6.9:3056) was formed. Members
> vhbl06 corosync[3759]:   [TOTEM ] A new membership (10.0.6.9:3056) was formed. Members
> vhbl07 corosync[3805]:   [TOTEM ] A new membership (10.0.6.9:3056) was formed. Members
> vhbl08 corosync[3687]:   [TOTEM ] A new membership (10.0.6.9:3056) was formed. Members
> vhbl03 corosync[3890]:   [TOTEM ] A new membership (10.0.6.9:3056) was formed. Members
> vhbl07 corosync[3805]:   [QUORUM] Members[6]: 167773705 167773706 167773707 167773708 167773709 167773710
> vhbl08 corosync[3687]:   [QUORUM] Members[6]: 167773705 167773706 167773707 167773708 167773709 167773710
> vhbl06 corosync[3759]:   [QUORUM] Members[6]: 167773705 167773706 167773707 167773708 167773709 167773710
> vhbl07 corosync[3805]:   [MAIN  ] Completed service synchronization, ready to provide service.
> vhbl04 corosync[3759]:   [QUORUM] Members[6]: 167773705 167773706 167773707 167773708 167773709 167773710
> vhbl08 corosync[3687]:   [MAIN  ] Completed service synchronization, ready to provide service.
> vhbl06 corosync[3759]:   [MAIN  ] Completed service synchronization, ready to provide service.
> vhbl04 corosync[3759]:   [MAIN  ] Completed service synchronization, ready to provide service.
> vhbl05 corosync[3919]:   [QUORUM] Members[6]: 167773705 167773706 167773707 167773708 167773709 167773710
> vhbl03 corosync[3890]:   [QUORUM] Members[6]: 167773705 167773706 167773707 167773708 167773709 167773710
> vhbl05 corosync[3919]:   [MAIN  ] Completed service synchronization, ready to provide service.
> vhbl03 corosync[3890]:   [MAIN  ] Completed service synchronization, ready to provide service.
> 
> The cluster is running Corosync 2.4.2 multicast.  Those lines really end
> at Members, there are no joined of left nodes listed.  Pacemaker on top
> reacts like:
> 
> [9982] vhbl03 pacemakerd:     info: pcmk_quorum_notification:   Quorum retained | membership=3056 members=6
> [9991] vhbl03       crmd:     info: pcmk_quorum_notification:   Quorum retained | membership=3056 members=6
> [9986] vhbl03        cib:     info: cib_process_request:        Completed cib_modify operation for section nodes: OK (rc=0, origin=vhbl07/crmd/4477, version=0.1694.12)
> [9986] vhbl03        cib:     info: cib_process_request:        Completed cib_modify operation for section status: OK (rc=0, origin=vhbl07/crmd/4478, version=0.1694.12)
> [9986] vhbl03        cib:     info: cib_process_ping:   Reporting our current digest to vhbl07: 85250f3039d269f96012750f13e232d9 for 0.1694.12 (0x55ef057447d0 0)
> 
> on all nodes except for vhbl07, where it says:
> 
> [9886] vhbl07       crmd:     info: pcmk_quorum_notification:   Quorum retained | membership=3056 members=6
> [9877] vhbl07 pacemakerd:     info: pcmk_quorum_notification:   Quorum retained | membership=3056 members=6
> [9881] vhbl07        cib:     info: cib_process_request:        Forwarding cib_modify operation for section nodes to all (origin=local/crmd/
> [9881] vhbl07        cib:     info: cib_process_request:        Forwarding cib_modify operation for section status to all (origin=local/crmd
> [9881] vhbl07        cib:     info: cib_process_request:        Completed cib_modify operation for section nodes: OK (rc=0, origin=vhbl07/cr
> [9881] vhbl07        cib:     info: cib_process_request:        Completed cib_modify operation for section status: OK (rc=0, origin=vhbl07/c
> [9881] vhbl07        cib:     info: cib_process_ping:   Reporting our current digest to vhbl07: 85250f3039d269f96012750f13e232d9 for 0.1694.
> 
> So Pacemaker does nothing, basically, and I can't see any adverse effect
> to resource management, but DLM seems to have some problem, which may or
> may not be related.  When the TOTEM error appears, all nodes log this:
> 
> vhbl03 dlm_controld[3914]: 2801675 dlm:controld ring 167773705:3056 6 memb 167773705 167773706 167773707 167773708 167773709 167773710
> vhbl03 dlm_controld[3914]: 2801675 fence work wait for cluster ringid
> vhbl03 dlm_controld[3914]: 2801675 dlm:ls:clvmd ring 167773705:3056 6 memb 167773705 167773706 167773707 167773708 167773709 167773710
> vhbl03 dlm_controld[3914]: 2801675 clvmd wait_messages cg 9 need 1 of 6
> vhbl03 dlm_controld[3914]: 2801675 fence work wait for cluster ringid
> vhbl03 dlm_controld[3914]: 2801675 cluster quorum 1 seq 3056 nodes 6
> 
> dlm_controld is running with --enable_fencing=0.  Pacemaker does its own
> fencing if resource management requires it, but DLM is used by cLVM
> only, which does not warrant such harsh measures.  Right now cLVM is
> blocked; I don't know since when, because we seldom do cLVM operations
> on this cluster.  My immediate aim is to unblock cLVM somehow.
> 
> While dlm_tool status reports (similar on all nodes):
> 
> cluster nodeid 167773705 quorate 1 ring seq 3088 3088
> daemon now 2941405 fence_pid 0 
> node 167773705 M add 196 rem 0 fail 0 fence 0 at 0 0
> node 167773706 M add 5960 rem 5730 fail 0 fence 0 at 0 0
> node 167773707 M add 2089 rem 1802 fail 0 fence 0 at 0 0
> node 167773708 M add 3646 rem 3413 fail 0 fence 0 at 0 0
> node 167773709 M add 2588921 rem 2588920 fail 0 fence 0 at 0 0
> node 167773710 M add 196 rem 0 fail 0 fence 0 at 0 0
> 
> dlm_tool ls shows "kern_stop":
> 
> dlm lockspaces
> name          clvmd
> id            0x4104eefa
> flags         0x00000004 kern_stop
> change        member 5 joined 0 remove 1 failed 1 seq 8,8
> members       167773705 167773706 167773707 167773708 167773710 
> new change    member 6 joined 1 remove 0 failed 0 seq 9,9
> new status    wait messages 1
> new members   167773705 167773706 167773707 167773708 167773709 167773710 
> 
> on all nodes except for vhbl07 (167773709), where it gives
> 
> dlm lockspaces
> name          clvmd
> id            0x4104eefa
> flags         0x00000000 
> change        member 6 joined 1 remove 0 failed 0 seq 11,11
> members       167773705 167773706 167773707 167773708 167773709 167773710 
> 
> instead.
> 
> Does anybody have an idea what the problem(s) might be?  Why is Corosync
> deteriorating on this cluster?  (It's running with RR PRIO 99.)  Could
> that have hurt DLM?  Is there a way to unblock DLM without rebooting all
> nodes?  (Actually, rebooting is problematic in itself with blocked cLVM,
> but that's tractable.)

Looks like the lost node wasn't fenced. Do you have fencing configured
and tested? If not, DLM will block forever because it won't recover
until it has been told that the lost peer has been fenced, by design.

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould