[ClusterLabs] corosync caused network breakdown

Mon Apr 8 13:08:32 EDT 2019

Sven,

> Hi,
> we were running a corosync config including 2 Rings for about 2.5 years on a two node NFS Cluster (active/passive). The first ring (ring 0) is configured on a dedicated NIC for Cluster internal communications. The second ring (ring 1) was configured in the interface where the NFS Service is running on. I thought, corosync would primarily use ring 0 and as fallback ring 1. But I was totally 

Corosync 2.x in RRP mode is always using all healthy rings. Difference 
between active and passive is that active is sending all messages via 
both rings and passive is sending packets in round-robin fashion.

wrong. Since this cluster is in production, we've seen this messages all 
the time:
> 
> [TOTEM ] Marking ringid 1 interface 10.7.2.101 FAULTY
> [TOTEM ] Automatically recovered ring 1
> 
> But everything seemed to be OK, until last Friday. First symptom was a dying MySQL Replication caused by massive network load. Some time later on, our Web Servers were failing. They went into socket time outs, caused by not reaching their NFS Shares (on that named cluster before). We encountered massive packet loss between  10% and 90% within our LAN. Especially the the NFS Cluster nodes. We were searching for several causes. Network loops, dying Switches and some other possibilities. But at the end we changed the corosync config to use only one ring (ring 0) and restarted the whole NFS Cluster. It took some Minutes, but the Network issues were gone. No Paket Loss on the NFS Cluster anymore.
> 
> So my guess is, that the Multicast Traffic of corosync killed our switches so they began to drop 

If user space application (corosync) really can kill switch (HW) then I 
would consider throwing away such switch.

network packets because of that over load caused by corosync. Is an 
issue like that already known? Has any one a clue what was going on?

I think it's ifdown problem. If you shutdown all the nodes and start 
them again with two rings does everything works?

Anyway. RRP is broken, that's why it was replaced in 3.x completely by 
knet. So If you want to stay with 2.x (eventho on Leap 42.3 you may 
consider updating to 3.x) then I would recommend to stay with single 
ring and use bonding/teaming if possible.

Honza

> 
> Here are the system specs/configs
> OS: openSUSE Leap 42.3
> Kernel: 4.4.126-48-default
> 
> Installed Cluster packages:
> rpm -qa | grep -Ei "(corosync|pace|cluster)"
> libpacemaker3-1.1.16-3.6.x86_64
> pacemaker-cli-1.1.16-3.6.x86_64
> pacemaker-cts-1.1.16-3.6.x86_64
> libcorosync4-2.3.6-7.1.x86_64
> pacemaker-1.1.16-3.6.x86_64
> pacemaker-remote-1.1.16-3.6.x86_64
> 
> Used Corosync Config at the time of the incident:
> totem {
> 	version: 2
> 	cluster_name: nfs-cluster
> 
> 	crypto_cipher: aes256
> 	crypto_hash: sha1
> 
> 	clear_node_high_bit: yes
> 
> 
> 	rrp_mode: passive
> 
> 	interface {
> 		ringnumber: 0
> 		bindnetaddr: 10.7.0.0
> 		mcastaddr: 239.255.54.33
> 		mcastport: 5417
> 		ttl: 2
> 	}
> 	interface {
> 		ringnumber: 1
> 		bindnetaddr: 10.7.2.0
> 		mcastaddr: 239.255.54.35
> 		mcastport: 5417
> 		ttl: 2
> 	}
> }
> 
> logging {
> 	fileline: off
> 	to_stderr: no
> 	to_syslog: yes
> 	debug: off
> 	timestamp: on
> 	logger_subsys {
> 		subsys: QUORUM
> 		debug: off
> 	}
> }
> 
> quorum {
> 	provider: corosync_votequorum
> 	expected_votes: 2
> 	two_node: 1
> 	wait_for_all: 0
> }
> 
> 
> kind regards
> Sven
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
>