[ClusterLabs] corosync 3.0.1 on Debian/Buster reports some MTU errors
Jean-Francois Malouin
Jean-Francois.Malouin at bic.mni.mcgill.ca
Thu Nov 21 10:56:46 EST 2019
Hi,
* christine caulfield <ccaulfie at redhat.com> [20191121 03:19]:
> On 18/11/2019 21:31, Jean-Francois Malouin wrote:
> > Hi,
> >
> > Maybe not directly a pacemaker question but maybe some of you have seen this
> > problem:
> >
> > A 2 node pacemaker cluster running corosync-3.0.1 with dual communication ring
> > sometimes reports errors like this in the corosync log file:
> >
> > [KNET ] pmtud: PMTUD link change for host: 2 link: 0 from 470 to 1366
> > [KNET ] pmtud: PMTUD link change for host: 2 link: 1 from 470 to 1366
> > [KNET ] pmtud: Global data MTU changed to: 1366
> > [CFG ] Modified entry 'totem.netmtu' in corosync.conf cannot be changed at run-time
> > [CFG ] Modified entry 'totem.netmtu' in corosync.conf cannot be changed at run-time
> >
> > Those do not happen very frequenly, once a week or so...
> >
>
> Those messages are caused by a config file reload (corosync-cfgtool -R)
> being triggered by something. If they're happening once a week then check
> your cron jobs.
no cronjob at work here, but maybe they originate from my own doing, after a
reload, as you suggest.
> > However the system log on the nodes reports those much more frequently, a few
> > times a day:
> >
> > Nov 17 23:26:20 node1 corosync[2258]: [KNET ] link: host: 2 link: 1 is down
> > Nov 17 23:26:20 node1 corosync[2258]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 0)
> > Nov 17 23:26:26 node1 corosync[2258]: [KNET ] rx: host: 2 link: 1 is up
> > Nov 17 23:26:26 node1 corosync[2258]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
> >
>
> Those don't look good. having a link down for 6 seconds looks like a serious
> network outage that needs looking into, especially if they are that
> frequent, or it could be a bug. You don't say which version of libknet you
> have installed but make sure it's the latest one.
libknet1 is 1.8-2 and is the latest one from Debian buster distro.
> The fencing event in your other message was caused because both links were
> down at the same time, which is a worrying co-incidence. Changing the token
> timeout won't make any difference to the knet link events, but if the knet
> links are down for long enough then that will trigger a token timeout and a
> fence event.
>
> Definitely look for something odd in your networking - the corosync.conf
> file looks sane (though having knet_transport in the top-level totem stanza
> is doing nothing), so it's not that.
>
> It's hard to make a judgement with just that info, but look for dropped
> packets on the interfaces, slow response to other network services or very
> high load on one of the nodes. If you can't see anything on the systems then
> enable debug logging and get back to us. If it is a bug we want it fixed!
Since that network outage no errors have crept in the corosync logs (I have
enabled debug on). I suspect, as you mention, a hardware issue, at the NIC
level, or cabling. I do notice quite a few dropped packets from one of the
links...
Thanks for the reply,
jf
>
> Chrissie
>
>
> > Are those to be dismissed or are they indicative of a network misconfig/problem?
> > I tried setting 'knet_transport: udpu' in the totem section (the default value)
> > but it didn't seem to make a difference...Hard coding netmtu to 1500 and
> > allowing for longer (10s) token timeout also didn't seem to affect the issue.
> >
> >
> > Corosync config follows:
> >
> > /etc/corosync/corosync.conf
> >
> > totem {
> > version: 2
> > cluster_name: bicha
> > transport: knet
> > link_mode: passive
> > ip_version: ipv4
> > token: 10000
> > netmtu: 1500
> > knet_transport: sctp
> > crypto_model: openssl
> > crypto_hash: sha256
> > crypto_cipher: aes256
> > keyfile: /etc/corosync/authkey
> > interface {
> > linknumber: 0
> > knet_transport: udp
> > knet_link_priority: 0
> > }
> > interface {
> > linknumber: 1
> > knet_transport: udp
> > knet_link_priority: 1
> > }
> > }
> > quorum {
> > provider: corosync_votequorum
> > two_node: 1
> > # expected_votes: 2
> > }
> > nodelist {
> > node {
> > ring0_addr: xxx.xxx.xxx.xxx
> > ring1_addr: zzz.zzz.zzz.zzx
> > name: node1
> > nodeid: 1
> > }
> > node {
> > ring0_addr: xxx.xxx.xxx.xxy
> > ring1_addr: zzz.zzz.zzz.zzy
> > name: node2
> > nodeid: 2
> > }
> > }
> > logging {
> > to_logfile: yes
> > to_syslog: yes
> > logfile: /var/log/corosync/corosync.log
> > syslog_facility: daemon
> > debug: off
> > timestamp: on
> > logger_subsys {
> > subsys: QUORUM
> > debug: off
> > }
> > }
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> >
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
More information about the Users
mailing list