[ClusterLabs] Incorrect Node Fencing Issue in Lustre Cluster During Network Failure Simulation
chenzufei at gmail.com
chenzufei at gmail.com
Tue Jun 10 03:54:04 UTC 2025
Background:
There are 4 physical machines, with two virtual machines running on each physical machine. lustre-mds-nodexx runs the Lustre MDS server, and lustre-oss-nodexx runs the Lustre OSS service. Each virtual machine is directly connected to two network interfaces, service1(ens6f0np0) and service2(ens6f1np1). Pacemaker is used to ensure high availability of the Lustre services.
Software versions:
Lustre: 2.15.5
Corosync: 3.1.5
Pacemaker: 2.1.0-8.el8
PCS: 0.10.8
Operation:
During testing, the network interfaces service1 and service2 on lustre-oss-node40 and lustre-mds-node40 were repeatedly brought up and down every 20 seconds (to simulate a network failure).
for i in {1..10}; do date; ifconfig ens6f0np0 down && ifconfig ens6f1np1 down; sleep 20; date; ifconfig ens6f0np0 up && ifconfig ens6f1np1 up; date;sleep 30
Issue:
Theoretically, lustre-oss-node40 and lustre-mds-node40 should have been fenced, but lustre-mds-node32 was fenced instead.
Related Logs:
Jun 09 17:54:51 node32 fence_virtd[2502]: Destroying domain 60e80c07-107e-4e8a-ba42-39e48b3e6bb7 // This log indicates that lustre-mds-node32 was fenced.
* turning off of lustre-mds-node32 successful: delegate=lustre-mds-node42, client=pacemaker-controld.8918, origin=lustre-mds-node42, completed='2025-06-09 17:54:54.527116 +08:00'
Jun 09 17:54:10 [1429] lustre-mds-node32 corosync info [KNET ] link: Resetting MTU for link 0 because host 1 joined
Jun 09 17:54:10 [1429] lustre-mds-node32 corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 09 17:54:10 [1429] lustre-mds-node32 corosync info [KNET ] pmtud: Global data MTU changed to: 1397
Jun 09 17:54:31 [1429] lustre-mds-node32 corosync info [KNET ] link: host: 1 link: 0 is down
Jun 09 17:54:31 [1429] lustre-mds-node32 corosync info [KNET ] host: host: 1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:34 [1429] lustre-mds-node32 corosync info [KNET ] link: host: 1 link: 1 is down
Jun 09 17:54:34 [1429] lustre-mds-node32 corosync info [KNET ] host: host: 1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:34 [1429] lustre-mds-node32 corosync warning [KNET ] host: host: 1 has no active links
Jun 09 17:54:36 [1429] lustre-mds-node32 corosync notice [TOTEM ] Token has not been received in 8475 ms
Jun 09 17:57:44 [1419] lustre-mds-node32 corosync notice [MAIN ] Corosync Cluster Engine 3.1.8 starting up
Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info [KNET ] link: host: 4 link: 0 is down
Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info [KNET ] link: host: 3 link: 0 is down
Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info [KNET ] link: host: 2 link: 0 is down
Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info [KNET ] host: host: 4 (passive) best link: 1 (pri: 1)
Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync notice [TOTEM ] Token has not been received in 8475 ms
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info [KNET ] link: host: 4 link: 1 is down
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info [KNET ] link: host: 3 link: 1 is down
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info [KNET ] link: host: 2 link: 1 is down
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info [KNET ] host: host: 4 (passive) best link: 1 (pri: 1)
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync warning [KNET ] host: host: 4 has no active links
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync warning [KNET ] host: host: 3 has no active links
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync warning [KNET ] host: host: 2 has no active links
Jun 09 17:54:37 [1412] lustre-mds-node40 corosync notice [TOTEM ] A processor failed, forming new configuration: token timed out (11300ms), waiting 13560ms for consensus.
Jun 09 17:54:46 [1412] lustre-mds-node40 corosync info [KNET ] link: Resetting MTU for link 1 because host 3 joined
Jun 09 17:54:46 [1412] lustre-mds-node40 corosync info [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
Jun 09 17:54:46 [1412] lustre-mds-node40 corosync info [KNET ] pmtud: Global data MTU changed to: 1397
Jun 09 17:54:47 [1412] lustre-mds-node40 corosync info [KNET ] link: Resetting MTU for link 1 because host 2 joined
Jun 09 17:54:47 [1412] lustre-mds-node40 corosync info [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
Jun 09 17:54:47 [1412] lustre-mds-node40 corosync info [KNET ] pmtud: Global data MTU changed to: 1397
Jun 09 17:54:50 [1412] lustre-mds-node40 corosync notice [QUORUM] Sync members[3]: 1 2 3
Jun 09 17:54:50 [1412] lustre-mds-node40 corosync notice [QUORUM] Sync left[1]: 4
Jun 09 17:54:50 [1412] lustre-mds-node40 corosync notice [TOTEM ] A new membership (1.45) was formed. Members left: 4
Jun 09 17:54:50 [1412] lustre-mds-node40 corosync notice [TOTEM ] Failed to receive the leave message. failed: 4
Jun 09 17:54:29 [8913] lustre-mds-node41 corosync info [KNET ] link: host: 1 link: 0 is down
Jun 09 17:54:29 [8913] lustre-mds-node41 corosync info [KNET ] host: host: 1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:30 [8913] lustre-mds-node41 corosync info [KNET ] link: host: 1 link: 1 is down
Jun 09 17:54:30 [8913] lustre-mds-node41 corosync info [KNET ] host: host: 1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:30 [8913] lustre-mds-node41 corosync warning [KNET ] host: host: 1 has no active links
Jun 09 17:54:36 [8913] lustre-mds-node41 corosync notice [TOTEM ] Token has not been received in 8475 ms
Jun 09 17:54:39 [8913] lustre-mds-node41 corosync notice [TOTEM ] A processor failed, forming new configuration: token timed out (11300ms), waiting 13560ms for consensus.
Jun 09 17:54:47 [8913] lustre-mds-node41 corosync info [KNET ] rx: host: 1 link: 1 is up
Jun 09 17:54:47 [8913] lustre-mds-node41 corosync info [KNET ] link: Resetting MTU for link 1 because host 1 joined
Jun 09 17:54:47 [8913] lustre-mds-node41 corosync info [KNET ] host: host: 1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:47 [8913] lustre-mds-node41 corosync info [KNET ] pmtud: Global data MTU changed to: 1397
Jun 09 17:54:50 [8913] lustre-mds-node41 corosync notice [QUORUM] Sync members[3]: 1 2 3
Jun 09 17:54:50 [8913] lustre-mds-node41 corosync notice [QUORUM] Sync left[1]: 4
Jun 09 17:54:50 [8913] lustre-mds-node41 corosync notice [TOTEM ] A new membership (1.45) was formed. Members left: 4
Jun 09 17:54:50 [8913] lustre-mds-node41 corosync notice [TOTEM ] Failed to receive the leave message. failed: 4
Jun 09 17:54:28 [8900] lustre-mds-node42 corosync info [KNET ] link: host: 1 link: 0 is down
Jun 09 17:54:28 [8900] lustre-mds-node42 corosync info [KNET ] host: host: 1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:30 [8900] lustre-mds-node42 corosync info [KNET ] link: host: 1 link: 1 is down
Jun 09 17:54:30 [8900] lustre-mds-node42 corosync info [KNET ] host: host: 1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:30 [8900] lustre-mds-node42 corosync warning [KNET ] host: host: 1 has no active links
Jun 09 17:54:36 [8900] lustre-mds-node42 corosync notice [TOTEM ] Token has not been received in 8475 ms
Jun 09 17:54:45 [8900] lustre-mds-node42 corosync info [KNET ] rx: host: 1 link: 1 is up
Jun 09 17:54:45 [8900] lustre-mds-node42 corosync info [KNET ] link: Resetting MTU for link 1 because host 1 joined
Jun 09 17:54:45 [8900] lustre-mds-node42 corosync info [KNET ] host: host: 1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:45 [8900] lustre-mds-node42 corosync info [KNET ] pmtud: Global data MTU changed to: 1397
Jun 09 17:54:50 [8900] lustre-mds-node42 corosync notice [QUORUM] Sync members[3]: 1 2 3
Jun 09 17:54:50 [8900] lustre-mds-node42 corosync notice [QUORUM] Sync left[1]: 4
Jun 09 17:54:50 [8900] lustre-mds-node42 corosync notice [TOTEM ] A new membership (1.45) was formed. Members left: 4
Jun 09 17:54:50 [8900] lustre-mds-node42 corosync notice [TOTEM ] Failed to receive the leave message. failed: 4
/etc/corosync/corosync.conf
totem {
version: 2
cluster_name: mds_cluster
transport: knet
crypto_cipher: aes256
crypto_hash: sha256
cluster_uuid: 11f2c4097ac44d5981769a9ed579c99e
token: 10000
}
nodelist {
node {
ring0_addr: 10.255.153.240
ring1_addr: 10.255.153.241
name: lustre-mds-node40
nodeid: 1
}
node {
ring0_addr: 10.255.153.244
ring1_addr: 10.255.153.245
name: lustre-mds-node41
nodeid: 2
}
node {
ring0_addr: 10.255.153.248
ring1_addr: 10.255.153.249
name: lustre-mds-node42
nodeid: 3
}
node {
ring0_addr: 10.255.153.236
ring1_addr: 10.255.153.237
name: lustre-mds-node32
nodeid: 4
}
}
quorum {
provider: corosync_votequorum
}
logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
timestamp: on
}
chenzufei at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20250610/a3c66c0d/attachment-0001.htm>
More information about the Users
mailing list