From chenzufei at gmail.com Mon Mar 3 03:52:24 2025 From: chenzufei at gmail.com (chenzufei at gmail.com) Date: Mon, 3 Mar 2025 11:52:24 +0800 Subject: [ClusterLabs] Lustre MDT/OST Mount Failures During Virtual Machine Reboot with Pacemaker Message-ID: <2025030311522366665011@gmail.com> 1. Background: There are three physical servers, each running a KVM virtual machine. The virtual machines host Lustre services (MGS/MDS/OSS). Pacemaker is used to ensure high availability of the Lustre services. lustre(2.15.5) + corosync(3.1.5) + pacemaker(2.1.0-8.el8) + pcs(0.10.8) 2. Problem: When a reboot command is issued on one of the virtual machines, the MDT/OST resources are taken over by the virtual machines on other nodes. However, the mounting of these resources fails during the switch (Pacemaker attempts to mount multiple times and eventually succeeds). Workaround: Before executing the reboot command, run pcs node standby to move the resources away. Question: I would like to know if this is an inherent issue with Pacemaker? 3. Analysis: From the log analysis, it appears that the MDT/OST resources are being mounted on the target node before the unmount process is completed on the source node. The Multiple Mount Protection (MMP) detects that the source node has updated the sequence number, which causes the mount operation to fail on the target node. 4. Logs: Node 28 (Source Node): Tue Feb 18 23:46:31 CST 2025 reboot ll /dev/disk/by-id/virtio-ost-node28-3-36 lrwxrwxrwx 1 root root 9 Feb 18 23:47 /dev/disk/by-id/virtio-ost-node28-3-36 -> ../../vdy Tue Feb 18 23:46:31 CST 2025 * ost-36_start_0 on lustre-oss-node29 'error' (1): call=769, status='complete', exitreason='Couldn't mount device [/dev/disk/by-id/virtio-ost-node28-3-36] as /lustre/ost-36', last-rc-change='Tue Feb 18 23:46:32 2025', queued=0ms, exec=21472ms Feb 18 23:46:31 lustre-oss-node28 systemd[1]: Unmounting /lustre/ost-36... Feb 18 23:46:31 lustre-oss-node28 kernel: LDISKFS-fs warning (device vdy): kmmpd:186: czf MMP failure info: epoch:6609375025013, seq: 37, last update time: 1739893591, last update node: lustre-oss-node28, last update device: vdy Feb 18 23:46:32 lustre-oss-node28 Filesystem(ost-36)[19748]: INFO: Running stop for /dev/disk/by-id/virtio-ost-node28-3-36 on /lustre/ost-36 Feb 18 23:46:32 lustre-oss-node28 pacemaker-controld[1700]: notice: Result of stop operation for ost-36 on lustre-oss-node28: ok Feb 18 23:46:34 lustre-oss-node28 kernel: LDISKFS-fs warning (device vdy): kmmpd:258: czf set mmp seq clean Feb 18 23:46:34 lustre-oss-node28 kernel: LDISKFS-fs warning (device vdy): kmmpd:258: czf MMP failure info: epoch:6612033802827, seq: 4283256144, last update time: 1739893594, last update node: lustre-oss-node28, last update device: vdy Feb 18 23:46:34 lustre-oss-node28 systemd[1]: Unmounted /lustre/ost-36. Node 29 (Target Node): /dev/disk/by-id/virtio-ost-node28-3-36 -> ../../vdt Feb 18 23:46:32 lustre-oss-node29 Filesystem(ost-36)[451114]: INFO: Running start for /dev/disk/by-id/virtio-ost-node28-3-36 on /lustre/ost-36 Feb 18 23:46:32 lustre-oss-node29 kernel: LDISKFS-fs warning (device vdt): ldiskfs_multi_mount_protect:350: MMP interval 42 higher than expected, please wait. Feb 18 23:46:53 lustre-oss-node29 kernel: czf, not equel, Current time: 23974372799987 ns, 37,4283256144 Feb 18 23:46:53 lustre-oss-node29 kernel: LDISKFS-fs warning (device vdt): ldiskfs_multi_mount_protect:364: czf MMP failure info: epoch:23974372801877, seq: 4283256144, last update time: 1739893594, last update node: lustre-oss-node28, last update device: vdy chenzufei at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From oalbrigt at redhat.com Mon Mar 3 09:03:06 2025 From: oalbrigt at redhat.com (Oyvind Albrigtsen) Date: Mon, 3 Mar 2025 10:03:06 +0100 Subject: [ClusterLabs] Lustre MDT/OST Mount Failures During Virtual Machine Reboot with Pacemaker In-Reply-To: <2025030311522366665011@gmail.com> References: <2025030311522366665011@gmail.com> Message-ID: You need the systemd drop-in functionality introduced in RHEL 9.3 to avoid this issue: https://bugzilla.redhat.com/show_bug.cgi?id=2184779 Oyvind On 03/03/25 11:52 +0800, chenzufei at gmail.com wrote: >1. Background: >There are three physical servers, each running a KVM virtual machine. The virtual machines host Lustre services (MGS/MDS/OSS). Pacemaker is used to ensure high availability of the Lustre services. >lustre(2.15.5) + corosync(3.1.5) + pacemaker(2.1.0-8.el8) + pcs(0.10.8) >2. Problem: >When a reboot command is issued on one of the virtual machines, the MDT/OST resources are taken over by the virtual machines on other nodes. However, the mounting of these resources fails during the switch (Pacemaker attempts to mount multiple times and eventually succeeds). >Workaround: Before executing the reboot command, run pcs node standby to move the resources away. >Question: I would like to know if this is an inherent issue with Pacemaker? >3. Analysis: >From the log analysis, it appears that the MDT/OST resources are being mounted on the target node before the unmount process is completed on the source node. The Multiple Mount Protection (MMP) detects that the source node has updated the sequence number, which causes the mount operation to fail on the target node. >4. Logs: >Node 28 (Source Node): >Tue Feb 18 23:46:31 CST 2025 reboot > >ll /dev/disk/by-id/virtio-ost-node28-3-36 >lrwxrwxrwx 1 root root 9 Feb 18 23:47 /dev/disk/by-id/virtio-ost-node28-3-36 -> ../../vdy > >Tue Feb 18 23:46:31 CST 2025 >* ost-36_start_0 on lustre-oss-node29 'error' (1): call=769, status='complete', exitreason='Couldn't mount device [/dev/disk/by-id/virtio-ost-node28-3-36] as /lustre/ost-36', last-rc-change='Tue Feb 18 23:46:32 2025', queued=0ms, exec=21472ms > >Feb 18 23:46:31 lustre-oss-node28 systemd[1]: Unmounting /lustre/ost-36... >Feb 18 23:46:31 lustre-oss-node28 kernel: LDISKFS-fs warning (device vdy): kmmpd:186: czf MMP failure info: epoch:6609375025013, seq: 37, last update time: 1739893591, last update node: lustre-oss-node28, last update device: vdy >Feb 18 23:46:32 lustre-oss-node28 Filesystem(ost-36)[19748]: INFO: Running stop for /dev/disk/by-id/virtio-ost-node28-3-36 on /lustre/ost-36 >Feb 18 23:46:32 lustre-oss-node28 pacemaker-controld[1700]: notice: Result of stop operation for ost-36 on lustre-oss-node28: ok >Feb 18 23:46:34 lustre-oss-node28 kernel: LDISKFS-fs warning (device vdy): kmmpd:258: czf set mmp seq clean >Feb 18 23:46:34 lustre-oss-node28 kernel: LDISKFS-fs warning (device vdy): kmmpd:258: czf MMP failure info: epoch:6612033802827, seq: 4283256144, last update time: 1739893594, last update node: lustre-oss-node28, last update device: vdy >Feb 18 23:46:34 lustre-oss-node28 systemd[1]: Unmounted /lustre/ost-36. > >Node 29 (Target Node): >/dev/disk/by-id/virtio-ost-node28-3-36 -> ../../vdt > >Feb 18 23:46:32 lustre-oss-node29 Filesystem(ost-36)[451114]: INFO: Running start for /dev/disk/by-id/virtio-ost-node28-3-36 on /lustre/ost-36 >Feb 18 23:46:32 lustre-oss-node29 kernel: LDISKFS-fs warning (device vdt): ldiskfs_multi_mount_protect:350: MMP interval 42 higher than expected, please wait. >Feb 18 23:46:53 lustre-oss-node29 kernel: czf, not equel, Current time: 23974372799987 ns, 37,4283256144 >Feb 18 23:46:53 lustre-oss-node29 kernel: LDISKFS-fs warning (device vdt): ldiskfs_multi_mount_protect:364: czf MMP failure info: epoch:23974372801877, seq: 4283256144, last update time: 1739893594, last update node: lustre-oss-node28, last update device: vdy > > > >chenzufei at gmail.com >_______________________________________________ >Manage your subscription: >https://lists.clusterlabs.org/mailman/listinfo/users > >ClusterLabs home: https://www.clusterlabs.org/ From alma21 at gmx.at Mon Mar 3 15:55:25 2025 From: alma21 at gmx.at (A M) Date: Mon, 3 Mar 2025 16:55:25 +0100 (GMT+01:00) Subject: [ClusterLabs] multiple nfs server resource groups Message-ID: Hi, is it possible to run? 2 or more nfs (server) resource groups in parallel/HA in one cluster ??? each with it's own/unique set of nfs server (instance) , vip, VG/LV, filesystem ,....? ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From chenzufei at gmail.com Fri Mar 14 09:48:22 2025 From: chenzufei at gmail.com (chenzufei at gmail.com) Date: Fri, 14 Mar 2025 17:48:22 +0800 Subject: [ClusterLabs] Investigation of Corosync Heartbeat Loss: Simulating Network Failures with Redundant Network Configuration Message-ID: <2025031417480017156612@gmail.com> Background: There are 11 physical machines, with two virtual machines running on each physical machine. lustre-mds-nodexx runs the Lustre MDS server, and lustre-oss-nodexx runs the Lustre OSS service. Each virtual machine is directly connected to two network interfaces, service1 and service2. Pacemaker is used to ensure high availability of the Lustre services. lustre(2.15.5) + corosync(3.1.5) + pacemaker(2.1.0-8.el8) + pcs(0.10.8) Issue: During testing, the network interface service1 on lustre-oss-node30 and lustre-oss-node40 was repeatedly brought up and down every 1 second (to simulate a network failure). The Corosync logs showed that heartbeats were lost, triggering a fencing action that powered off the nodes with lost heartbeats. Given that Corosync is configured with redundant networks, why did the heartbeat loss occur? Is it due to a configuration issue, or is Corosync not designed to handle this scenario? Other? The configuration of corosync.conf can be found in the attached file corosync.conf. Other relevant information is available in the attached file log.txt. The script used for the up/down testing is attached as ip_up_and_down.sh. chenzufei at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: log.txt Type: application/octet-stream Size: 25107 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ip_up_and_down.sh Type: application/octet-stream Size: 209 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: corosync.conf Type: application/octet-stream Size: 1863 bytes Desc: not available URL: From arvidjaar at gmail.com Fri Mar 14 10:43:56 2025 From: arvidjaar at gmail.com (Andrei Borzenkov) Date: Fri, 14 Mar 2025 13:43:56 +0300 Subject: [ClusterLabs] Investigation of Corosync Heartbeat Loss: Simulating Network Failures with Redundant Network Configuration In-Reply-To: <2025031417480017156612@gmail.com> References: <2025031417480017156612@gmail.com> Message-ID: On Fri, Mar 14, 2025 at 12:48?PM chenzufei at gmail.com wrote: > > > Background: > There are 11 physical machines, with two virtual machines running on each physical machine. > lustre-mds-nodexx runs the Lustre MDS server, and lustre-oss-nodexx runs the Lustre OSS service. > Each virtual machine is directly connected to two network interfaces, service1 and service2. > Pacemaker is used to ensure high availability of the Lustre services. > lustre(2.15.5) + corosync(3.1.5) + pacemaker(2.1.0-8.el8) + pcs(0.10.8) > > Issue: During testing, the network interface service1 on lustre-oss-node30 and lustre-oss-node40 was repeatedly brought up and down every 1 second (to simulate a network failure). > The Corosync logs showed that heartbeats were lost, triggering a fencing action that powered off the nodes with lost heartbeats. > Given that Corosync is configured with redundant networks, why did the heartbeat loss occur? Is it due to a configuration issue, or is Corosync not designed to handle this scenario? I cannot answer this question, but the common advice on this list was to *not* test by bringing an interface down but by blocking communication, e.g. using netfilter (iptables/nftables). > > Other? > The configuration of corosync.conf can be found in the attached file corosync.conf. > Other relevant information is available in the attached file log.txt. > The script used for the up/down testing is attached as ip_up_and_down.sh. > > > > ________________________________ > chenzufei at gmail.com > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ From jfriesse at redhat.com Fri Mar 14 11:02:30 2025 From: jfriesse at redhat.com (Jan Friesse) Date: Fri, 14 Mar 2025 12:02:30 +0100 Subject: [ClusterLabs] Investigation of Corosync Heartbeat Loss: Simulating Network Failures with Redundant Network Configuration In-Reply-To: <2025031417480017156612@gmail.com> References: <2025031417480017156612@gmail.com> Message-ID: <19e01d82-57d7-e46c-4383-4acbccf2230f@redhat.com> On 14/03/2025 10:48, chenzufei at gmail.com wrote: > > Background: > There are 11 physical machines, with two virtual machines running on each physical machine. > lustre-mds-nodexx runs the Lustre MDS server, and lustre-oss-nodexx runs the Lustre OSS service. > Each virtual machine is directly connected to two network interfaces, service1 and service2. > Pacemaker is used to ensure high availability of the Lustre services. > lustre(2.15.5) + corosync(3.1.5) + pacemaker(2.1.0-8.el8) + pcs(0.10.8) > > Issue: During testing, the network interface service1 on lustre-oss-node30 and lustre-oss-node40 was repeatedly brought up and down every 1 second (to simulate a network failure). > The Corosync logs showed that heartbeats were lost, triggering a fencing action that powered off the nodes with lost heartbeats. > Given that Corosync is configured with redundant networks, why did the heartbeat loss Honestly I don't think it is really configured with redundant networks. occur? Is it due to a configuration issue, or is Corosync not designed to handle this scenario? Ifdown is not ideal method of testing but Corosync 3.x should be able to handle it. Still using iptables/nftables/firewall is recommended. > > Other? > The configuration of corosync.conf can be found in the attached file corosync.conf. From config file it looks like both rings are on the same network. Could you please share your network configuration? Honza > Other relevant information is available in the attached file log.txt. > The script used for the up/down testing is attached as ip_up_and_down.sh. > > > > > > chenzufei at gmail.com > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > From chenzufei at gmail.com Sat Mar 15 03:31:30 2025 From: chenzufei at gmail.com (chenzufei at gmail.com) Date: Sat, 15 Mar 2025 11:31:30 +0800 Subject: [ClusterLabs] Lustre MDT/OST Mount Failures During Virtual Machine References: Message-ID: <202503151131279039978@gmail.com> Thank you for your advice. The reason I understand is as follows: During reboot, both the system and Pacemaker will unmount the Lustre resource simultaneously. If the system unmounts first and Pacemaker unmounts afterward, Pacemaker will immediately return success. However, at this point, the system's unmounting process is not yet complete, causing Pacemaker to mount on the target end, which triggers this issue. My current modification is as follows: Add the following lines to the file `/usr/lib/systemd/system/resource-agents-deps.target`: ``` After=remote-fs.target Before=shutdown.target reboot.target halt.target ``` After making this modification, the issue no longer occurs during reboot. chenzufei at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From chenzufei at gmail.com Sat Mar 15 03:36:08 2025 From: chenzufei at gmail.com (chenzufei at gmail.com) Date: Sat, 15 Mar 2025 11:36:08 +0800 Subject: [ClusterLabs] Lustre MDT/OST Mount Failures During Virtual Machine Reboot with Pacemaker References: Message-ID: <2025031511360481435411@gmail.com> Thank you for your advice. The reason I understand is as follows: During reboot, both the system and Pacemaker will unmount the Lustre resource simultaneously. If the system unmounts first and Pacemaker unmounts afterward, Pacemaker will immediately return success. However, at this point, the system's unmounting process is not yet complete, causing Pacemaker to mount on the target end, which triggers this issue. My current modification is as follows: Add the following lines to the file `/usr/lib/systemd/system/resource-agents-deps.target`: ``` After=remote-fs.target Before=shutdown.target reboot.target halt.target ``` After making this modification, the issue no longer occurs during reboot. chenzufei at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From piled.email at gmail.com Sun Mar 16 12:20:04 2025 From: piled.email at gmail.com (Piled Email) Date: Sun, 16 Mar 2025 13:20:04 +0100 Subject: [ClusterLabs] Resource placement on remote nodes with location constraints and transient attributes not working Message-ID: I am trying to place a resource on a remote node with a location constraint that uses a transient attribute, and this does not work. The resource is a custom resource script that starts an interface using networkctl: primitive router-interface ocf:custom:network_interface params interface=router The location constraint is: location router-interface-location router-interface rule router-score: defined router-ping and router-ping gt 0 (router-ping is set by a ping resource) The nodes: charybdis: remote router-score=6000 skylla: remote router-score=9000 The transient attributes: > attrd_updater -A --name router-ping name="router-ping" host="charybdis" value="100" name="router-ping" host="skylla" value="100" CRM simulate output shows that the resource will not be placed anywhere. crm_simulate -sL -Q pcmk__primitive_assign: router-interface allocation score on argos: -INFINITY pcmk__primitive_assign: router-interface allocation score on hestia: -INFINITY pcmk__primitive_assign: router-interface allocation score on pylon: -INFINITY If I use only non-transient attributes, the resource will be placed, e.g. the following places the resource: location router-interface-location router-interface rule router-score: defined router-score But as soon as the location constraint references transient attributes, it will not work. Pacemaker version is 2.1.8. Am I doing something wrong here? From u.windl at ukr.de Mon Mar 17 09:22:28 2025 From: u.windl at ukr.de (Windl, Ulrich) Date: Mon, 17 Mar 2025 09:22:28 +0000 Subject: [ClusterLabs] [EXT] Investigation of Corosync Heartbeat Loss: Simulating Network Failures with Redundant Network Configuration In-Reply-To: <2025031417480017156612@gmail.com> References: <2025031417480017156612@gmail.com> Message-ID: <6cd2683c81b048af90780cd45c293ce5@ukr.de> Looking at node { ring0_addr: 10.255.153.159 ring1_addr: 10.255.153.160 name: lustre-oss-node31 nodeid: 4 } I wonder how the packets are routed: What is the netmask? Kind regards, Ulrich Windl From: Users On Behalf Of chenzufei at gmail.com Sent: Friday, March 14, 2025 10:48 AM To: users Subject: [EXT] [ClusterLabs] Investigation of Corosync Heartbeat Loss: Simulating Network Failures with Redundant Network Configuration Background: There are 11 physical machines, with two virtual machines running on each physical machine. lustre-mds-nodexx runs the Lustre MDS server, and lustre-oss-nodexx runs the Lustre OSS service. Each virtual machine is directly connected to two network interfaces, service1 and service2. Pacemaker is used to ensure high availability of the Lustre services. lustre(2.15.5) + corosync(3.1.5) + pacemaker(2.1.0-8.el8) + pcs(0.10.8) Issue: During testing, the network interface service1 on lustre-oss-node30 and lustre-oss-node40 was repeatedly brought up and down every 1 second (to simulate a network failure). The Corosync logs showed that heartbeats were lost, triggering a fencing action that powered off the nodes with lost heartbeats. Given that Corosync is configured with redundant networks, why did the heartbeat loss occur? Is it due to a configuration issue, or is Corosync not designed to handle this scenario? Other? The configuration of corosync.conf can be found in the attached file corosync.conf. Other relevant information is available in the attached file log.txt. The script used for the up/down testing is attached as ip_up_and_down.sh. ________________________________ chenzufei at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From tbean74 at gmail.com Mon Mar 24 01:23:50 2025 From: tbean74 at gmail.com (Travis Bean) Date: Sun, 23 Mar 2025 18:23:50 -0700 Subject: [ClusterLabs] Bash automation for Pacemaker, Corosync, DRBD, GFS, CLVM, and LCMC Message-ID: Hello, I developed a Bash script to automate the installation and configuration of open-source software (i.e., launchpad.net/linuxha). I want to make sure the syntax of this script is perfect so I can use it as a teaching tool to educate people about Linux. I need to know if there is anything misconfigured with my LinuxHA high-availability Bash syntax. Writing the Bash code to automate the installation and configuration of Pacemaker, Corosync, DRBD, GFS, CLVM, and LCMC was a painstaking process of trial and error to create this custom setup. I could not find any how-to guide on the internet to show me step-by-step instructions for what I wanted to achieve, so when I first got this custom setup successfully working on two virtual machines, it was really a miracle. I haven?t been able to thoroughly test this Bash script out to make sure it functions as intended for all environments, so I need feedback to make sure I haven?t overlooked something with my current syntax. If you find a bug in LinuxHA, please submit a bug report to bugs.launchpad.net/linuxha/+filebug. Kind regards, Travis Bean