[ClusterLabs] corosync 3.0.1 on Debian/Buster reports some MTU errors

Wed Nov 20 15:35:04 EST 2019

No one is willing to take a shot at this?

I had a fencing event related to that yesterday morning

Nov 19 08:04:01 node2 corosync[14399]:   [KNET  ] link: host: 1 link: 0 is down
Nov 19 08:04:01 node2 corosync[14399]:   [KNET  ] host: host: 1 (passive) best link: 1 (pri: 1)
...
Nov 19 08:05:04 node2 corosync[14399]:   [KNET  ] link: host: 1 link: 1 is down
Nov 19 08:05:04 node2 corosync[14399]:   [KNET  ] host: host: 1 (passive) best link: 1 (pri: 1)
Nov 19 08:05:04 node2 corosync[14399]:   [KNET  ] host: host: 1 has no active links

There are 2 links so I'm a bit baffled why the 2nd one didn't do the job...

thanks,
jf

* Jean-Francois Malouin <Jean-Francois.Malouin at bic.mni.mcgill.ca> [20191118 16:31]:
> Hi,
> 
> Maybe not directly a pacemaker question but maybe some of you have seen this
> problem:
> 
> A 2 node pacemaker cluster running corosync-3.0.1 with dual communication ring
> sometimes reports errors like this in the corosync log file:
> 
> [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 470 to 1366
> [KNET  ] pmtud: PMTUD link change for host: 2 link: 1 from 470 to 1366
> [KNET  ] pmtud: Global data MTU changed to: 1366
> [CFG   ] Modified entry 'totem.netmtu' in corosync.conf cannot be changed at run-time
> [CFG   ] Modified entry 'totem.netmtu' in corosync.conf cannot be changed at run-time
> 
> Those do not happen very frequenly, once a week or so...
> 
> However the system log on the nodes reports those much more frequently, a few
> times a day:
> 
> Nov 17 23:26:20 node1 corosync[2258]:   [KNET  ] link: host: 2 link: 1 is down
> Nov 17 23:26:20 node1 corosync[2258]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0)
> Nov 17 23:26:26 node1 corosync[2258]:   [KNET  ] rx: host: 2 link: 1 is up
> Nov 17 23:26:26 node1 corosync[2258]:   [KNET  ] host: host: 2 (passive) best link: 1 (pri: 1)
> 
> Are those to be dismissed or are they indicative of a network misconfig/problem?
> I tried setting 'knet_transport: udpu' in the totem section (the default value)
> but it didn't seem to make a difference...Hard coding netmtu to 1500 and
> allowing for longer (10s) token timeout also didn't seem to affect the issue.
> 
> 
> Corosync config follows:
> 
> /etc/corosync/corosync.conf
> 
> totem {
>     version: 2
>     cluster_name: bicha
>     transport: knet
>     link_mode: passive
>     ip_version: ipv4
>     token: 10000
>     netmtu: 1500
>     knet_transport: sctp
>     crypto_model: openssl
>     crypto_hash: sha256
>     crypto_cipher: aes256
>     keyfile: /etc/corosync/authkey
>     interface {
>         linknumber: 0
>         knet_transport: udp
>         knet_link_priority: 0
>     }
>     interface {
>         linknumber: 1
>         knet_transport: udp
>         knet_link_priority: 1
>     }
> }
> quorum {
>     provider: corosync_votequorum
>     two_node: 1
> #    expected_votes: 2
> }
> nodelist {
>     node {
>         ring0_addr: xxx.xxx.xxx.xxx
>         ring1_addr: zzz.zzz.zzz.zzx
>         name: node1
>         nodeid: 1
>     } 
>     node {
>         ring0_addr: xxx.xxx.xxx.xxy
>         ring1_addr: zzz.zzz.zzz.zzy
>         name: node2
>         nodeid: 2
>     } 
> }
> logging {
>     to_logfile: yes
>     to_syslog: yes
>     logfile: /var/log/corosync/corosync.log
>     syslog_facility: daemon
>     debug: off
>     timestamp: on
>     logger_subsys {
>         subsys: QUORUM
>         debug: off
>     }
> }
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/