[ClusterLabs] heads up: Possible VM data corruption upgrading to SLES15 SP3

Thu Jan 27 09:10:20 EST 2022

Hi!

I know this is semi-offtopic, but I think it's important:
I've upgraded one cluster node being a Xen host from SLES15 SP2 to SLES15 SP3 using virtual DVD boot (i.e. the upgrade environment is loaded from that DVD).
Watching the syslog while Yast was searching for systems to upgrade, I noticed that it tried to mount _every_ disk read-only.
(We use multipathed FC SAN disks that are attached as block devices to the VMs, so they look like "normal" disks)

On my first attempt I did not enable multipath as it it not needed to upgrade the OS (system VG is single-pathed), but then LVM complained about multiple disks having the same ID.
On the second attempt I did activate multipathing, but then Yast mounted every disk and tried to assemble every MDRAID it found, even if that was on shared storage, thus being actively in use by the other cluster nodes.

To make things worse, even when mounting read-only, XFS (for example) tried to "recover" a filesystem when it thinks it is dirty.
I found no way to avoid that mounting (a support case at SUSE is in progress).

Fortunately if the VMs were running for a significant time, most blocks are cached inside the VM, and blocks are "mostly written" instead of being read. So most likely the badly recovered blocks are overwritten with good data before the machine reboots and the bad blocks would be read.

This most obvious "solution" to stop every VM on the whole cluster before upgrading a single node is not very HA-like, unfortunately.

Any better ideas anyone?

Regards,
Ulrich