[ClusterLabs] Live migration problem

Digimer lists at alteeve.ca
Wed Oct 5 13:02:09 EDT 2016

Hi all,

  I just spent a fair bit of time debugging a weird error, and now that
I've solved it, I wanted to share it on the list so that it is archived.
With luck, it will save someone else some heartache. No replies are
expected. :)

* Anvil m2 (RHEL 6.8, cman+rgmanager+kvm+drbd+clvmd, fully updated)
* Guest VM OS - Win2012 R2 64-bit

  When I tried to live-migrate the server, rgmanager failed with:

[root at an-a07n02 ~]# clusvcadm -M Windows-Server-2012-R2 -m
Trying to migrate service:Windows-Server-2012-R2 to
an-a07n02.alteeve.ca...Failed; service running on original owner

/var/log/messages showed:
Oct  4 19:15:05 an-a07n01 rgmanager[4213]: Migrating
vm:Windows-Server-2012-R2 to an-a07n02.alteeve.ca
Oct  4 19:15:41 an-a07n01 rgmanager[7588]: [vm] Migrate
Windows-Server-2012-R2 to an-a07n02.alteeve.ca failed:
Oct  4 19:15:41 an-a07n01 rgmanager[7610]: [vm] error: Unable to read
from monitor: Connection reset by peer
Oct  4 19:15:41 an-a07n01 rgmanager[4213]: migrate on vm
"Windows-Server-2012-R2" returned 150 (unspecified)
Oct  4 19:15:41 an-a07n01 rgmanager[4213]: Migration of
vm:Windows-Server-2012-R2 to an-a07n02.alteeve.ca failed; return code 150

I disabled the VM in rgmanager, manually booted it using virsh and tried
to live migrate it directly. Note that I booted the server on node 2
fine, and was trying to migrate from 2 -> 1. Note also that the
'--unsafe' is required because nodes using 4kib sector disks can't use
'cache="none"' in KVM/qemu (so we set 'write-through', so it is still safe).

[root at an-a07n02 ~]# virsh migrate --live Windows-Server-2012-R2
qemu+ssh://an-a07n01.alteeve.ca/system --unsafe
error: Unable to read from monitor: Connection reset by peer

In the qemu log file:

2016-10-05 16:11:19.948+0000: starting up
LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=spice
/usr/libexec/qemu-kvm -name Windows-Server-2012-R2 -S -M rhel6.6.0 -cpu
-enable-kvm -m 16384 -realtime mlock=off -smp
4,sockets=4,cores=1,threads=1 -uuid be69b994-0f70-ccf3-2934-43eb4a4b795b
-nodefconfig -nodefaults -chardev
-mon chardev=charmonitor,id=monitor,mode=control -rtc
base=localtime,driftfix=slew -no-reboot -no-shutdown -device
ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x4.0x7 -device
-device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive
-device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0
-netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=27 -device
-chardev pty,id=charserial0 -device
isa-serial,chardev=charserial0,id=serial0 -chardev
spicevmc,id=charchannel0,name=vdagent -device
-device usb-tablet,id=input0 -spice
port=5900,addr=,disable-ticketing,seamless-migration=on -vga
qxl -global qxl-vga.ram_size=67108864 -global qxl-vga.vram_size=67108864
-incoming tcp:[::]:49152 -device
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on
char device redirected to /dev/pts/0
Features 0x20000250 unsupported. Allowed features: 0x71000454
qemu: warning: error while loading state for instance 0x0 of device
load of migration failed
2016-10-05 16:11:31.503+0000: shutting down

The key here was "qemu: warning: error while loading state for instance
0x0 of device '0000:00:06.0/virtio-blk'".

There was precious little matching this on google. I could see no
problems with the XML definition, the backing LVs (two on this VM, the
LVs are passed up raw to the guest).

Inside the guest OS, I could see no problems. I could, as mentioned
above, boot the server on both nodes, but I could not live migrate.

I got to the point where I started throwing things against the wall out
of desperation. One of those was to try updating the virtio-block
drivers on the guest. The guest was built with 0.1.102 virtio stable
drivers, and the latest stable is now 0.1.126. So I updated the drivers
in Device Manager and voila! Migration started working.

We have many Win2012 R2 guests out in production, and many are using the
.102 drivers. So I have a feeling that it wasn't so much the upgrade
that made the difference, but instead the reinstall of the drivers.

I have no idea why this bug happened, but hopefully this might save
someone some grief in the future if they hit the same.

Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

More information about the Users mailing list