[ClusterLabs] Incorrect Node Fencing Issue in Lustre Cluster During Network Failure Simulation

Tue Jun 10 03:54:04 UTC 2025

Background:
There are 4 physical machines, with two virtual machines running on each physical machine. lustre-mds-nodexx runs the Lustre MDS server, and lustre-oss-nodexx runs the Lustre OSS service. Each virtual machine is directly connected to two network interfaces, service1(ens6f0np0) and service2(ens6f1np1). Pacemaker is used to ensure high availability of the Lustre services.
Software versions:
Lustre: 2.15.5
Corosync: 3.1.5
Pacemaker: 2.1.0-8.el8
PCS: 0.10.8

Operation:
During testing, the network interfaces service1 and service2 on lustre-oss-node40 and lustre-mds-node40 were repeatedly brought up and down every 20 seconds (to simulate a network failure).

for i in {1..10}; do date; ifconfig ens6f0np0 down && ifconfig ens6f1np1 down; sleep 20; date; ifconfig ens6f0np0 up && ifconfig ens6f1np1 up; date;sleep 30

Issue:
Theoretically, lustre-oss-node40 and lustre-mds-node40 should have been fenced, but lustre-mds-node32 was fenced instead.

Related Logs:
Jun 09 17:54:51 node32 fence_virtd[2502]: Destroying domain 60e80c07-107e-4e8a-ba42-39e48b3e6bb7   // This log indicates that lustre-mds-node32 was fenced.

* turning off of lustre-mds-node32 successful: delegate=lustre-mds-node42, client=pacemaker-controld.8918, origin=lustre-mds-node42, completed='2025-06-09 17:54:54.527116 +08:00'

Jun 09 17:54:10 [1429] lustre-mds-node32 corosync info    [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Jun 09 17:54:10 [1429] lustre-mds-node32 corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 09 17:54:10 [1429] lustre-mds-node32 corosync info    [KNET  ] pmtud: Global data MTU changed to: 1397
Jun 09 17:54:31 [1429] lustre-mds-node32 corosync info    [KNET  ] link: host: 1 link: 0 is down
Jun 09 17:54:31 [1429] lustre-mds-node32 corosync info    [KNET  ] host: host: 1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:34 [1429] lustre-mds-node32 corosync info    [KNET  ] link: host: 1 link: 1 is down
Jun 09 17:54:34 [1429] lustre-mds-node32 corosync info    [KNET  ] host: host: 1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:34 [1429] lustre-mds-node32 corosync warning [KNET  ] host: host: 1 has no active links
Jun 09 17:54:36 [1429] lustre-mds-node32 corosync notice  [TOTEM ] Token has not been received in 8475 ms
Jun 09 17:57:44 [1419] lustre-mds-node32 corosync notice  [MAIN  ] Corosync Cluster Engine 3.1.8 starting up

Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info    [KNET  ] link: host: 4 link: 0 is down
Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info    [KNET  ] link: host: 3 link: 0 is down
Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info    [KNET  ] link: host: 2 link: 0 is down
Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info    [KNET  ] host: host: 4 (passive) best link: 1 (pri: 1)
Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info    [KNET  ] host: host: 3 (passive) best link: 1 (pri: 1)
Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info    [KNET  ] host: host: 2 (passive) best link: 1 (pri: 1)
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync notice  [TOTEM ] Token has not been received in 8475 ms
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info    [KNET  ] link: host: 4 link: 1 is down
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info    [KNET  ] link: host: 3 link: 1 is down
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info    [KNET  ] link: host: 2 link: 1 is down
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info    [KNET  ] host: host: 4 (passive) best link: 1 (pri: 1)
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync warning [KNET  ] host: host: 4 has no active links
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info    [KNET  ] host: host: 3 (passive) best link: 1 (pri: 1)
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync warning [KNET  ] host: host: 3 has no active links
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info    [KNET  ] host: host: 2 (passive) best link: 1 (pri: 1)
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync warning [KNET  ] host: host: 2 has no active links
Jun 09 17:54:37 [1412] lustre-mds-node40 corosync notice  [TOTEM ] A processor failed, forming new configuration: token timed out (11300ms), waiting 13560ms for consensus.
Jun 09 17:54:46 [1412] lustre-mds-node40 corosync info    [KNET  ] link: Resetting MTU for link 1 because host 3 joined
Jun 09 17:54:46 [1412] lustre-mds-node40 corosync info    [KNET  ] host: host: 3 (passive) best link: 1 (pri: 1)
Jun 09 17:54:46 [1412] lustre-mds-node40 corosync info    [KNET  ] pmtud: Global data MTU changed to: 1397
Jun 09 17:54:47 [1412] lustre-mds-node40 corosync info    [KNET  ] link: Resetting MTU for link 1 because host 2 joined
Jun 09 17:54:47 [1412] lustre-mds-node40 corosync info    [KNET  ] host: host: 2 (passive) best link: 1 (pri: 1)
Jun 09 17:54:47 [1412] lustre-mds-node40 corosync info    [KNET  ] pmtud: Global data MTU changed to: 1397
Jun 09 17:54:50 [1412] lustre-mds-node40 corosync notice  [QUORUM] Sync members[3]: 1 2 3
Jun 09 17:54:50 [1412] lustre-mds-node40 corosync notice  [QUORUM] Sync left[1]: 4
Jun 09 17:54:50 [1412] lustre-mds-node40 corosync notice  [TOTEM ] A new membership (1.45) was formed. Members left: 4
Jun 09 17:54:50 [1412] lustre-mds-node40 corosync notice  [TOTEM ] Failed to receive the leave message. failed: 4

Jun 09 17:54:29 [8913] lustre-mds-node41 corosync info    [KNET  ] link: host: 1 link: 0 is down
Jun 09 17:54:29 [8913] lustre-mds-node41 corosync info    [KNET  ] host: host: 1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:30 [8913] lustre-mds-node41 corosync info    [KNET  ] link: host: 1 link: 1 is down
Jun 09 17:54:30 [8913] lustre-mds-node41 corosync info    [KNET  ] host: host: 1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:30 [8913] lustre-mds-node41 corosync warning [KNET  ] host: host: 1 has no active links
Jun 09 17:54:36 [8913] lustre-mds-node41 corosync notice  [TOTEM ] Token has not been received in 8475 ms
Jun 09 17:54:39 [8913] lustre-mds-node41 corosync notice  [TOTEM ] A processor failed, forming new configuration: token timed out (11300ms), waiting 13560ms for consensus.
Jun 09 17:54:47 [8913] lustre-mds-node41 corosync info    [KNET  ] rx: host: 1 link: 1 is up
Jun 09 17:54:47 [8913] lustre-mds-node41 corosync info    [KNET  ] link: Resetting MTU for link 1 because host 1 joined
Jun 09 17:54:47 [8913] lustre-mds-node41 corosync info    [KNET  ] host: host: 1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:47 [8913] lustre-mds-node41 corosync info    [KNET  ] pmtud: Global data MTU changed to: 1397
Jun 09 17:54:50 [8913] lustre-mds-node41 corosync notice  [QUORUM] Sync members[3]: 1 2 3
Jun 09 17:54:50 [8913] lustre-mds-node41 corosync notice  [QUORUM] Sync left[1]: 4
Jun 09 17:54:50 [8913] lustre-mds-node41 corosync notice  [TOTEM ] A new membership (1.45) was formed. Members left: 4
Jun 09 17:54:50 [8913] lustre-mds-node41 corosync notice  [TOTEM ] Failed to receive the leave message. failed: 4

Jun 09 17:54:28 [8900] lustre-mds-node42 corosync info    [KNET  ] link: host: 1 link: 0 is down
Jun 09 17:54:28 [8900] lustre-mds-node42 corosync info    [KNET  ] host: host: 1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:30 [8900] lustre-mds-node42 corosync info    [KNET  ] link: host: 1 link: 1 is down
Jun 09 17:54:30 [8900] lustre-mds-node42 corosync info    [KNET  ] host: host: 1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:30 [8900] lustre-mds-node42 corosync warning [KNET  ] host: host: 1 has no active links
Jun 09 17:54:36 [8900] lustre-mds-node42 corosync notice  [TOTEM ] Token has not been received in 8475 ms
Jun 09 17:54:45 [8900] lustre-mds-node42 corosync info    [KNET  ] rx: host: 1 link: 1 is up
Jun 09 17:54:45 [8900] lustre-mds-node42 corosync info    [KNET  ] link: Resetting MTU for link 1 because host 1 joined
Jun 09 17:54:45 [8900] lustre-mds-node42 corosync info    [KNET  ] host: host: 1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:45 [8900] lustre-mds-node42 corosync info    [KNET  ] pmtud: Global data MTU changed to: 1397
Jun 09 17:54:50 [8900] lustre-mds-node42 corosync notice  [QUORUM] Sync members[3]: 1 2 3
Jun 09 17:54:50 [8900] lustre-mds-node42 corosync notice  [QUORUM] Sync left[1]: 4
Jun 09 17:54:50 [8900] lustre-mds-node42 corosync notice  [TOTEM ] A new membership (1.45) was formed. Members left: 4
Jun 09 17:54:50 [8900] lustre-mds-node42 corosync notice  [TOTEM ] Failed to receive the leave message. failed: 4

/etc/corosync/corosync.conf
totem {
    version: 2
    cluster_name: mds_cluster
    transport: knet
    crypto_cipher: aes256
    crypto_hash: sha256
    cluster_uuid: 11f2c4097ac44d5981769a9ed579c99e
    token: 10000
}

nodelist {
    node {
        ring0_addr: 10.255.153.240
        ring1_addr: 10.255.153.241
        name: lustre-mds-node40
        nodeid: 1
    }

    node {
        ring0_addr: 10.255.153.244
        ring1_addr: 10.255.153.245
        name: lustre-mds-node41
        nodeid: 2
    }

    node {
        ring0_addr: 10.255.153.248
        ring1_addr: 10.255.153.249
        name: lustre-mds-node42
        nodeid: 3
    }

    node {
        ring0_addr: 10.255.153.236
        ring1_addr: 10.255.153.237
        name: lustre-mds-node32
        nodeid: 4
    }
}

quorum {
    provider: corosync_votequorum
}

logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
    timestamp: on
}

chenzufei at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20250610/a3c66c0d/attachment-0001.htm>