[ClusterLabs] corosync caused network breakdown
Sven Möller
smoeller at nichthelfer.de
Mon Apr 8 10:11:09 EDT 2019
Hi,
we were running a corosync config including 2 Rings for about 2.5 years on a two node NFS Cluster (active/passive). The first ring (ring 0) is configured on a dedicated NIC for Cluster internal communications. The second ring (ring 1) was configured in the interface where the NFS Service is running on. I thought, corosync would primarily use ring 0 and as fallback ring 1. But I was totally wrong. Since this cluster is in production, we've seen this messages all the time:
[TOTEM ] Marking ringid 1 interface 10.7.2.101 FAULTY
[TOTEM ] Automatically recovered ring 1
But everything seemed to be OK, until last Friday. First symptom was a dying MySQL Replication caused by massive network load. Some time later on, our Web Servers were failing. They went into socket time outs, caused by not reaching their NFS Shares (on that named cluster before). We encountered massive packet loss between 10% and 90% within our LAN. Especially the the NFS Cluster nodes. We were searching for several causes. Network loops, dying Switches and some other possibilities. But at the end we changed the corosync config to use only one ring (ring 0) and restarted the whole NFS Cluster. It took some Minutes, but the Network issues were gone. No Paket Loss on the NFS Cluster anymore.
So my guess is, that the Multicast Traffic of corosync killed our switches so they began to drop network packets because of that over load caused by corosync. Is an issue like that already known? Has any one a clue what was going on?
Here are the system specs/configs
OS: openSUSE Leap 42.3
Kernel: 4.4.126-48-default
Installed Cluster packages:
rpm -qa | grep -Ei "(corosync|pace|cluster)"
libpacemaker3-1.1.16-3.6.x86_64
pacemaker-cli-1.1.16-3.6.x86_64
pacemaker-cts-1.1.16-3.6.x86_64
libcorosync4-2.3.6-7.1.x86_64
pacemaker-1.1.16-3.6.x86_64
pacemaker-remote-1.1.16-3.6.x86_64
Used Corosync Config at the time of the incident:
totem {
version: 2
cluster_name: nfs-cluster
crypto_cipher: aes256
crypto_hash: sha1
clear_node_high_bit: yes
rrp_mode: passive
interface {
ringnumber: 0
bindnetaddr: 10.7.0.0
mcastaddr: 239.255.54.33
mcastport: 5417
ttl: 2
}
interface {
ringnumber: 1
bindnetaddr: 10.7.2.0
mcastaddr: 239.255.54.35
mcastport: 5417
ttl: 2
}
}
logging {
fileline: off
to_stderr: no
to_syslog: yes
debug: off
timestamp: on
logger_subsys {
subsys: QUORUM
debug: off
}
}
quorum {
provider: corosync_votequorum
expected_votes: 2
two_node: 1
wait_for_all: 0
}
kind regards
Sven
More information about the Users
mailing list