[ClusterLabs] Investigation of Corosync Heartbeat Loss: Simulating Network Failures with Redundant Network Configuration

Fri Mar 14 10:43:56 UTC 2025

On Fri, Mar 14, 2025 at 12:48 PM chenzufei at gmail.com
<chenzufei at gmail.com> wrote:
>
>
> Background:
> There are 11 physical machines, with two virtual machines running on each physical machine.
> lustre-mds-nodexx runs the Lustre MDS server, and lustre-oss-nodexx runs the Lustre OSS service.
> Each virtual machine is directly connected to two network interfaces, service1 and service2.
> Pacemaker is used to ensure high availability of the Lustre services.
> lustre(2.15.5) + corosync(3.1.5) + pacemaker(2.1.0-8.el8) + pcs(0.10.8)
>
> Issue: During testing, the network interface service1 on lustre-oss-node30 and lustre-oss-node40 was repeatedly brought up and down every 1 second (to simulate a network failure).
> The Corosync logs showed that heartbeats were lost, triggering a fencing action that powered off the nodes with lost heartbeats.
> Given that Corosync is configured with redundant networks, why did the heartbeat loss occur? Is it due to a configuration issue, or is Corosync not designed to handle this scenario?

I cannot answer this question, but the common advice on this list was
to *not* test by bringing an interface down but by blocking
communication, e.g. using netfilter (iptables/nftables).

>
> Other：
> The configuration of corosync.conf can be found in the attached file corosync.conf.
> Other relevant information is available in the attached file log.txt.
> The script used for the up/down testing is attached as ip_up_and_down.sh.
>
>
>
> ________________________________
> chenzufei at gmail.com
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/