[ClusterLabs] Antw: Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck

Tue Aug 29 06:49:59 UTC 2017

>>> Ferenc Wágner <wferi at niif.hu> schrieb am 28.08.2017 um 18:07 in Nachricht
<87mv6jk75r.fsf at lant.ki.iif.hu>:

[...]
cLVM under I/O load can be really slow (I'm talking about delays in the range
of a few seconds). Be sure to have any timeouts adjusted accordingly. I wrote a
tool that allows to monitor the read latency (as seen by applications), so I
know these numbers. And things get significantly worse if you do cLVM mirroring
with a mirrorlog replicated to each device.
Maybe the CLVM slows down at n^2, where n is the number of nodes; I don't know
;-)

Regards,
Ulrich

> So Pacemaker does nothing, basically, and I can't see any adverse effect
> to resource management, but DLM seems to have some problem, which may or
> may not be related.  When the TOTEM error appears, all nodes log this:
> 
> vhbl03 dlm_controld[3914]: 2801675 dlm:controld ring 167773705:3056 6 memb 
> 167773705 167773706 167773707 167773708 167773709 167773710
> vhbl03 dlm_controld[3914]: 2801675 fence work wait for cluster ringid
> vhbl03 dlm_controld[3914]: 2801675 dlm:ls:clvmd ring 167773705:3056 6 memb 
> 167773705 167773706 167773707 167773708 167773709 167773710
> vhbl03 dlm_controld[3914]: 2801675 clvmd wait_messages cg 9 need 1 of 6
> vhbl03 dlm_controld[3914]: 2801675 fence work wait for cluster ringid
> vhbl03 dlm_controld[3914]: 2801675 cluster quorum 1 seq 3056 nodes 6
> 
> dlm_controld is running with --enable_fencing=0.  Pacemaker does its own
> fencing if resource management requires it, but DLM is used by cLVM
> only, which does not warrant such harsh measures.  Right now cLVM is
> blocked; I don't know since when, because we seldom do cLVM operations
> on this cluster.  My immediate aim is to unblock cLVM somehow.
> 
> While dlm_tool status reports (similar on all nodes):
> 
> cluster nodeid 167773705 quorate 1 ring seq 3088 3088
> daemon now 2941405 fence_pid 0 
> node 167773705 M add 196 rem 0 fail 0 fence 0 at 0 0
> node 167773706 M add 5960 rem 5730 fail 0 fence 0 at 0 0
> node 167773707 M add 2089 rem 1802 fail 0 fence 0 at 0 0
> node 167773708 M add 3646 rem 3413 fail 0 fence 0 at 0 0
> node 167773709 M add 2588921 rem 2588920 fail 0 fence 0 at 0 0
> node 167773710 M add 196 rem 0 fail 0 fence 0 at 0 0
> 
> dlm_tool ls shows "kern_stop":
> 
> dlm lockspaces
> name          clvmd
> id            0x4104eefa
> flags         0x00000004 kern_stop
> change        member 5 joined 0 remove 1 failed 1 seq 8,8
> members       167773705 167773706 167773707 167773708 167773710 
> new change    member 6 joined 1 remove 0 failed 0 seq 9,9
> new status    wait messages 1
> new members   167773705 167773706 167773707 167773708 167773709 167773710 
> 
> on all nodes except for vhbl07 (167773709), where it gives
> 
> dlm lockspaces
> name          clvmd
> id            0x4104eefa
> flags         0x00000000 
> change        member 6 joined 1 remove 0 failed 0 seq 11,11
> members       167773705 167773706 167773707 167773708 167773709 167773710 
> 
> instead.
> 
> Does anybody have an idea what the problem(s) might be?  Why is Corosync
> deteriorating on this cluster?  (It's running with RR PRIO 99.)  Could
> that have hurt DLM?  Is there a way to unblock DLM without rebooting all
> nodes?  (Actually, rebooting is problematic in itself with blocked cLVM,
> but that's tractable.)
> -- 
> Thanks,
> Feri
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org