[ClusterLabs] Live migration problem

Digimer lists at alteeve.ca
Wed Oct 5 13:02:09 EDT 2016


Hi all,

  I just spent a fair bit of time debugging a weird error, and now that
I've solved it, I wanted to share it on the list so that it is archived.
With luck, it will save someone else some heartache. No replies are
expected. :)

Environment:
* Anvil m2 (RHEL 6.8, cman+rgmanager+kvm+drbd+clvmd, fully updated)
* Guest VM OS - Win2012 R2 64-bit

  When I tried to live-migrate the server, rgmanager failed with:

[root at an-a07n02 ~]# clusvcadm -M Windows-Server-2012-R2 -m
an-a07n02.alteeve.ca
Trying to migrate service:Windows-Server-2012-R2 to
an-a07n02.alteeve.ca...Failed; service running on original owner

/var/log/messages showed:
====
Oct  4 19:15:05 an-a07n01 rgmanager[4213]: Migrating
vm:Windows-Server-2012-R2 to an-a07n02.alteeve.ca
Oct  4 19:15:41 an-a07n01 rgmanager[7588]: [vm] Migrate
Windows-Server-2012-R2 to an-a07n02.alteeve.ca failed:
Oct  4 19:15:41 an-a07n01 rgmanager[7610]: [vm] error: Unable to read
from monitor: Connection reset by peer
Oct  4 19:15:41 an-a07n01 rgmanager[4213]: migrate on vm
"Windows-Server-2012-R2" returned 150 (unspecified)
Oct  4 19:15:41 an-a07n01 rgmanager[4213]: Migration of
vm:Windows-Server-2012-R2 to an-a07n02.alteeve.ca failed; return code 150
====

I disabled the VM in rgmanager, manually booted it using virsh and tried
to live migrate it directly. Note that I booted the server on node 2
fine, and was trying to migrate from 2 -> 1. Note also that the
'--unsafe' is required because nodes using 4kib sector disks can't use
'cache="none"' in KVM/qemu (so we set 'write-through', so it is still safe).

[root at an-a07n02 ~]# virsh migrate --live Windows-Server-2012-R2
qemu+ssh://an-a07n01.alteeve.ca/system --unsafe
error: Unable to read from monitor: Connection reset by peer

In the qemu log file:

====
2016-10-05 16:11:19.948+0000: starting up
LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=spice
/usr/libexec/qemu-kvm -name Windows-Server-2012-R2 -S -M rhel6.6.0 -cpu
SandyBridge,+erms,+smep,+fsgsbase,+pdpe1gb,+rdrand,+f16c,+osxsave,+dca,+pcid,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
-enable-kvm -m 16384 -realtime mlock=off -smp
4,sockets=4,cores=1,threads=1 -uuid be69b994-0f70-ccf3-2934-43eb4a4b795b
-nodefconfig -nodefaults -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/Windows-Server-2012-R2.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc
base=localtime,driftfix=slew -no-reboot -no-shutdown -device
ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x4.0x7 -device
ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x4
-device
ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x4.0x1
-device
ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x4.0x2
-device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive
file=/shared/files/Windows_2012_R2_64-bit_eval.iso,if=none,media=cdrom,id=drive-ide0-0-0,readonly=on,format=raw
-device
ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=2
-drive
file=/shared/files/virtio-win-0.1.102.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw
-device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0
-drive
file=/dev/an-a07n01_vg0/Windows-Server-2012-R2_0,if=none,id=drive-virtio-disk0,format=raw,cache=writethrough,aio=native
-device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
-drive
file=/dev/an-a07n01_vg0/Windows-Server-2012-R2_1,if=none,id=drive-virtio-disk1,format=raw,cache=writethrough,aio=native
-device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk1,id=virtio-disk1
-netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=27 -device
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:80:2d:e0,bus=pci.0,addr=0x3
-chardev pty,id=charserial0 -device
isa-serial,chardev=charserial0,id=serial0 -chardev
spicevmc,id=charchannel0,name=vdagent -device
virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0
-device usb-tablet,id=input0 -spice
port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on -vga
qxl -global qxl-vga.ram_size=67108864 -global qxl-vga.vram_size=67108864
-incoming tcp:[::]:49152 -device
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on
char device redirected to /dev/pts/0
Features 0x20000250 unsupported. Allowed features: 0x71000454
qemu: warning: error while loading state for instance 0x0 of device
'0000:00:06.0/virtio-blk'
load of migration failed
2016-10-05 16:11:31.503+0000: shutting down
====

The key here was "qemu: warning: error while loading state for instance
0x0 of device '0000:00:06.0/virtio-blk'".

There was precious little matching this on google. I could see no
problems with the XML definition, the backing LVs (two on this VM, the
LVs are passed up raw to the guest).

Inside the guest OS, I could see no problems. I could, as mentioned
above, boot the server on both nodes, but I could not live migrate.

I got to the point where I started throwing things against the wall out
of desperation. One of those was to try updating the virtio-block
drivers on the guest. The guest was built with 0.1.102 virtio stable
drivers, and the latest stable is now 0.1.126. So I updated the drivers
in Device Manager and voila! Migration started working.

We have many Win2012 R2 guests out in production, and many are using the
.102 drivers. So I have a feeling that it wasn't so much the upgrade
that made the difference, but instead the reinstall of the drivers.

I have no idea why this bug happened, but hopefully this might save
someone some grief in the future if they hit the same.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?




More information about the Users mailing list