[ClusterLabs] Investigation of Corosync Heartbeat Loss: Simulating Network Failures with Redundant Network Configuration

chenzufei at gmail.com chenzufei at gmail.com
Fri Mar 14 09:48:22 UTC 2025


Background: 
There are 11 physical machines, with two virtual machines running on each physical machine.
lustre-mds-nodexx runs the Lustre MDS server, and lustre-oss-nodexx runs the Lustre OSS service.
Each virtual machine is directly connected to two network interfaces, service1 and service2.
Pacemaker is used to ensure high availability of the Lustre services.
lustre(2.15.5) + corosync(3.1.5) + pacemaker(2.1.0-8.el8) + pcs(0.10.8)

Issue: During testing, the network interface service1 on lustre-oss-node30 and lustre-oss-node40 was repeatedly brought up and down every 1 second (to simulate a network failure).
The Corosync logs showed that heartbeats were lost, triggering a fencing action that powered off the nodes with lost heartbeats.
Given that Corosync is configured with redundant networks, why did the heartbeat loss occur? Is it due to a configuration issue, or is Corosync not designed to handle this scenario?

Other:
The configuration of corosync.conf can be found in the attached file corosync.conf.
Other relevant information is available in the attached file log.txt.
The script used for the up/down testing is attached as ip_up_and_down.sh.





chenzufei at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20250314/e36f13fe/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: log.txt
Type: application/octet-stream
Size: 25107 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20250314/e36f13fe/attachment-0003.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ip_up_and_down.sh
Type: application/octet-stream
Size: 209 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20250314/e36f13fe/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: corosync.conf
Type: application/octet-stream
Size: 1863 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20250314/e36f13fe/attachment-0005.obj>


More information about the Users mailing list