[ClusterLabs] corosync 3.0.1 on Debian/Buster reports some MTU errors

Thu Nov 21 10:56:46 EST 2019

Hi,

* christine caulfield <ccaulfie at redhat.com> [20191121 03:19]:
> On 18/11/2019 21:31, Jean-Francois Malouin wrote:
> > Hi,
> > 
> > Maybe not directly a pacemaker question but maybe some of you have seen this
> > problem:
> > 
> > A 2 node pacemaker cluster running corosync-3.0.1 with dual communication ring
> > sometimes reports errors like this in the corosync log file:
> > 
> > [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 470 to 1366
> > [KNET  ] pmtud: PMTUD link change for host: 2 link: 1 from 470 to 1366
> > [KNET  ] pmtud: Global data MTU changed to: 1366
> > [CFG   ] Modified entry 'totem.netmtu' in corosync.conf cannot be changed at run-time
> > [CFG   ] Modified entry 'totem.netmtu' in corosync.conf cannot be changed at run-time
> > 
> > Those do not happen very frequenly, once a week or so...
> > 
> 
> Those messages are caused by a config file reload (corosync-cfgtool -R)
> being triggered by something. If they're happening once a week then check
> your cron jobs.

no cronjob at work here, but maybe they originate from my own doing, after a
reload, as you suggest.

> > However the system log on the nodes reports those much more frequently, a few
> > times a day:
> > 
> > Nov 17 23:26:20 node1 corosync[2258]:   [KNET  ] link: host: 2 link: 1 is down
> > Nov 17 23:26:20 node1 corosync[2258]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0)
> > Nov 17 23:26:26 node1 corosync[2258]:   [KNET  ] rx: host: 2 link: 1 is up
> > Nov 17 23:26:26 node1 corosync[2258]:   [KNET  ] host: host: 2 (passive) best link: 1 (pri: 1)
> > 
> 
> Those don't look good. having a link down for 6 seconds looks like a serious
> network outage that needs looking into, especially if they are that
> frequent, or it could be a bug. You don't say which version of libknet you
> have installed but make sure it's the latest one.

libknet1 is 1.8-2 and is the latest one from Debian buster distro.

> The fencing event in your other message was caused because both links were
> down at the same time, which is a worrying co-incidence. Changing the token
> timeout won't make any difference to the knet link events, but if the knet
> links are down for long enough then that will trigger a token timeout and a
> fence event.
> 
> Definitely look for something odd in your networking - the corosync.conf
> file looks sane (though having knet_transport in the top-level totem stanza
> is doing nothing), so it's not that.
> 
> It's hard to make a judgement with just that info, but look for dropped
> packets on the interfaces, slow response to other network services or very
> high load on one of the nodes. If you can't see anything on the systems then
> enable debug logging and get back to us. If it is a bug we want it fixed!

Since that network outage no errors have crept in the corosync logs (I have
enabled debug on).  I suspect, as you mention, a hardware issue, at the NIC
level, or cabling. I do notice quite a few dropped packets from one of the
links...

Thanks for the reply,
jf

> 
> Chrissie
> 
> 
> > Are those to be dismissed or are they indicative of a network misconfig/problem?
> > I tried setting 'knet_transport: udpu' in the totem section (the default value)
> > but it didn't seem to make a difference...Hard coding netmtu to 1500 and
> > allowing for longer (10s) token timeout also didn't seem to affect the issue.
> > 
> > 
> > Corosync config follows:
> > 
> > /etc/corosync/corosync.conf
> > 
> > totem {
> >      version: 2
> >      cluster_name: bicha
> >      transport: knet
> >      link_mode: passive
> >      ip_version: ipv4
> >      token: 10000
> >      netmtu: 1500
> >      knet_transport: sctp
> >      crypto_model: openssl
> >      crypto_hash: sha256
> >      crypto_cipher: aes256
> >      keyfile: /etc/corosync/authkey
> >      interface {
> >          linknumber: 0
> >          knet_transport: udp
> >          knet_link_priority: 0
> >      }
> >      interface {
> >          linknumber: 1
> >          knet_transport: udp
> >          knet_link_priority: 1
> >      }
> > }
> > quorum {
> >      provider: corosync_votequorum
> >      two_node: 1
> > #    expected_votes: 2
> > }
> > nodelist {
> >      node {
> >          ring0_addr: xxx.xxx.xxx.xxx
> >          ring1_addr: zzz.zzz.zzz.zzx
> >          name: node1
> >          nodeid: 1
> >      }
> >      node {
> >          ring0_addr: xxx.xxx.xxx.xxy
> >          ring1_addr: zzz.zzz.zzz.zzy
> >          name: node2
> >          nodeid: 2
> >      }
> > }
> > logging {
> >      to_logfile: yes
> >      to_syslog: yes
> >      logfile: /var/log/corosync/corosync.log
> >      syslog_facility: daemon
> >      debug: off
> >      timestamp: on
> >      logger_subsys {
> >          subsys: QUORUM
> >          debug: off
> >      }
> > }
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > ClusterLabs home: https://www.clusterlabs.org/
> > 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/