[ClusterLabs] DLM recovery stuck

Thu Aug 9 19:10:02 UTC 2018

Hi Feri, rule of thumb is use separate dedicated network for corosync traffic. For ex. we use two corosync rings, first and active one on separate network card and switch, second passive one on team (bond) device vlan.  

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: support at feldhost.cz

www.feldhost.cz - FeldHost™ – Hostingové služby prispôsobíme vám. Máte špecifické požiadavky? Poradíme si s nimi.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010 0000 0024 0033 0446

> On 9 Aug 2018, at 20:17, Ferenc Wágner <wferi at niif.hu> wrote:
> 
> David Teigland <teigland at redhat.com> writes:
> 
>> On Thu, Aug 09, 2018 at 06:11:48PM +0200, Ferenc Wágner wrote:
>> 
>>> Almost ten years ago you requested more info in a similar case, let's
>>> see if we can get further now!
>> 
>> Hi, the usual cause is that a network message from the dlm has been
>> lost/dropped/missed.  The dlm can't recover from that, which is clearly a
>> weak point in the design.  There may be some new development coming along
>> to finally improve that.
> 
> Hi David,
> 
> Good to hear!  Can you share any more info about this development?
> 
>> One way you can confirm this is to check if the dlm on one or more nodes
>> is waiting for a message that's not arriving.  Often you'll see an entry
>> in the dlm "waiters" debugfs file corresponding to a response that's being
>> waited on.
> 
> If you mean dlm/clvmd_waiters, it's empty on all nodes.  Is there
> anything else to check?
> 
>> Another red flag is kernel messages from a driver indicating some network
>> hickup at the time things hung.  I can't say if these messages you sent
>> happened at the right time, or if they even correspond to the dlm
>> interface, but it's worth checking as a possible explanation:
>> 
>> [  137.207059] be2net 0000:05:00.0 enp5s0f0: Link is Up
>> [  137.252901] be2net 0000:05:00.1 enp5s0f1: Link is Up
> 
> Hard to say...  This is an iSCSI offload card with two physical ports,
> which are virtualized in the card into 4-4 logical ports, 3-3 of which
> are passed to the OS as separate PCI functions, while the other two are
> used for iSCSI traffic.  The DLM traffic goes through a Linux bond made
> of enp5s0f4 and enp5s0f5, which is started at 112.393798 and used for
> Corosync traffic first.  The above two lines are signs of OpenVSwitch
> starting up for independent purposes.  It should be totally independent,
> but it's the same device after all, so I can't exclude all possibility
> of "crosstalk".
> 
>> [  153.886619]  connection2:0: detected conn error (1011)
> 
> See above: iSCSI traffic is offloaded, not visible on the OS level, and
> these connection failures are expected at the moment because some of the
> targets are inaccessible.  *But* it uses the same wire in the end, just
> different VLANs, and the virtualization (in the card itself) may not
> provide absolutely perfect separation.
> -- 
> Thanks,
> Feri
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180809/604427a6/attachment.html>