[ClusterLabs] Investigation of Corosync Heartbeat Loss: Simulating Network Failures with Redundant Network Configuration

Tue May 27 07:24:36 UTC 2025

The server-side configuration IP addresses are similar and belong to the same subnet:
lustre-mds-node32
service1: 10.255.153.236
service2: 10.255.153.237
lustre-oss-node32
service1: 10.255.153.238
service2: 10.255.153.239
lustre-mds-node40
service1: 10.255.153.240
service2: 10.255.153.241
lustre-oss-node40
service1: 10.255.153.242
service2: 10.255.153.243
lustre-mds-node41
service1: 10.255.153.244
service2: 10.255.153.245
lustre-oss-node41
service1: 10.255.153.246
service2: 10.255.153.247
Root Cause
The root cause of the issue is that messages sent to service2 fail to receive a reply from the correct interface. Specifically, replies are being sent from service1 instead of service2, which leads to communication failures.
Solution
The solution involves configuring policy-based routing on the server side, similar to the ARP flux issue for MR node mentioned in the https://wiki.lustre.org/LNet_Router_Config_Guide.

chenzufei at gmail.com

From: users-request
Date: 2025-03-14 17:48
To: users
Subject: Users Digest, Vol 122, Issue 3
Send Users mailing list submissions to
users at clusterlabs.org

To subscribe or unsubscribe via the World Wide Web, visit
https://lists.clusterlabs.org/mailman/listinfo/users
or, via email, send a message with subject or body 'help' to
users-request at clusterlabs.org

You can reach the person managing the list at
users-owner at clusterlabs.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Users digest..."

Today's Topics:

   1. Investigation of Corosync Heartbeat Loss: Simulating Network
      Failures with Redundant Network Configuration (chenzufei at gmail.com)

----------------------------------------------------------------------

Message: 1
Date: Fri, 14 Mar 2025 17:48:22 +0800
From: "chenzufei at gmail.com" <chenzufei at gmail.com>
To: users <users at clusterlabs.org>
Subject: [ClusterLabs] Investigation of Corosync Heartbeat Loss:
Simulating Network Failures with Redundant Network Configuration
Message-ID: <2025031417480017156612 at gmail.com>
Content-Type: text/plain; charset="gb2312"

Background: 
There are 11 physical machines, with two virtual machines running on each physical machine.
lustre-mds-nodexx runs the Lustre MDS server, and lustre-oss-nodexx runs the Lustre OSS service.
Each virtual machine is directly connected to two network interfaces, service1 and service2.
Pacemaker is used to ensure high availability of the Lustre services.
lustre(2.15.5) + corosync(3.1.5) + pacemaker(2.1.0-8.el8) + pcs(0.10.8)

Issue: During testing, the network interface service1 on lustre-oss-node30 and lustre-oss-node40 was repeatedly brought up and down every 1 second (to simulate a network failure).
The Corosync logs showed that heartbeats were lost, triggering a fencing action that powered off the nodes with lost heartbeats.
Given that Corosync is configured with redundant networks, why did the heartbeat loss occur? Is it due to a configuration issue, or is Corosync not designed to handle this scenario?

Other?
The configuration of corosync.conf can be found in the attached file corosync.conf.
Other relevant information is available in the attached file log.txt.
The script used for the up/down testing is attached as ip_up_and_down.sh.

chenzufei at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20250314/e36f13fe/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: log.txt
Type: application/octet-stream
Size: 25107 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20250314/e36f13fe/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ip_up_and_down.sh
Type: application/octet-stream
Size: 209 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20250314/e36f13fe/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: corosync.conf
Type: application/octet-stream
Size: 1863 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20250314/e36f13fe/attachment-0002.obj>

------------------------------

Subject: Digest Footer

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

------------------------------

End of Users Digest, Vol 122, Issue 3
*************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20250527/6ef4df44/attachment-0001.htm>