[ClusterLabs] corosync 3.0.1 on Debian/Buster reports some MTU errors

christine caulfield ccaulfie at redhat.com
Thu Nov 21 03:19:22 EST 2019


On 18/11/2019 21:31, Jean-Francois Malouin wrote:
> Hi,
> 
> Maybe not directly a pacemaker question but maybe some of you have seen this
> problem:
> 
> A 2 node pacemaker cluster running corosync-3.0.1 with dual communication ring
> sometimes reports errors like this in the corosync log file:
> 
> [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 470 to 1366
> [KNET  ] pmtud: PMTUD link change for host: 2 link: 1 from 470 to 1366
> [KNET  ] pmtud: Global data MTU changed to: 1366
> [CFG   ] Modified entry 'totem.netmtu' in corosync.conf cannot be changed at run-time
> [CFG   ] Modified entry 'totem.netmtu' in corosync.conf cannot be changed at run-time
> 
> Those do not happen very frequenly, once a week or so...
> 

Those messages are caused by a config file reload (corosync-cfgtool -R) 
being triggered by something. If they're happening once a week then 
check your cron jobs.

> However the system log on the nodes reports those much more frequently, a few
> times a day:
> 
> Nov 17 23:26:20 node1 corosync[2258]:   [KNET  ] link: host: 2 link: 1 is down
> Nov 17 23:26:20 node1 corosync[2258]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0)
> Nov 17 23:26:26 node1 corosync[2258]:   [KNET  ] rx: host: 2 link: 1 is up
> Nov 17 23:26:26 node1 corosync[2258]:   [KNET  ] host: host: 2 (passive) best link: 1 (pri: 1)
> 

Those don't look good. having a link down for 6 seconds looks like a 
serious network outage that needs looking into, especially if they are 
that frequent, or it could be a bug. You don't say which version of 
libknet you have installed but make sure it's the latest one.

The fencing event in your other message was caused because both links 
were down at the same time, which is a worrying co-incidence. Changing 
the token timeout won't make any difference to the knet link events, but 
if the knet links are down for long enough then that will trigger a 
token timeout and a fence event.

Definitely look for something odd in your networking - the corosync.conf 
file looks sane (though having knet_transport in the top-level totem 
stanza is doing nothing), so it's not that.

It's hard to make a judgement with just that info, but look for dropped 
packets on the interfaces, slow response to other network services or 
very high load on one of the nodes. If you can't see anything on the 
systems then enable debug logging and get back to us. If it is a bug we 
want it fixed!

Chrissie


> Are those to be dismissed or are they indicative of a network misconfig/problem?
> I tried setting 'knet_transport: udpu' in the totem section (the default value)
> but it didn't seem to make a difference...Hard coding netmtu to 1500 and
> allowing for longer (10s) token timeout also didn't seem to affect the issue.
> 
> 
> Corosync config follows:
> 
> /etc/corosync/corosync.conf
> 
> totem {
>      version: 2
>      cluster_name: bicha
>      transport: knet
>      link_mode: passive
>      ip_version: ipv4
>      token: 10000
>      netmtu: 1500
>      knet_transport: sctp
>      crypto_model: openssl
>      crypto_hash: sha256
>      crypto_cipher: aes256
>      keyfile: /etc/corosync/authkey
>      interface {
>          linknumber: 0
>          knet_transport: udp
>          knet_link_priority: 0
>      }
>      interface {
>          linknumber: 1
>          knet_transport: udp
>          knet_link_priority: 1
>      }
> }
> quorum {
>      provider: corosync_votequorum
>      two_node: 1
> #    expected_votes: 2
> }
> nodelist {
>      node {
>          ring0_addr: xxx.xxx.xxx.xxx
>          ring1_addr: zzz.zzz.zzz.zzx
>          name: node1
>          nodeid: 1
>      }
>      node {
>          ring0_addr: xxx.xxx.xxx.xxy
>          ring1_addr: zzz.zzz.zzz.zzy
>          name: node2
>          nodeid: 2
>      }
> }
> logging {
>      to_logfile: yes
>      to_syslog: yes
>      logfile: /var/log/corosync/corosync.log
>      syslog_facility: daemon
>      debug: off
>      timestamp: on
>      logger_subsys {
>          subsys: QUORUM
>          debug: off
>      }
> }
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 



More information about the Users mailing list