[Pacemaker] iscsi migration to slow (disk errors) what to do ...

Wed Jun 15 09:08:37 EDT 2011

On 14-06-11 22:35, Florian Haas wrote:
> On 06/14/2011 01:38 PM, Jelle de Jong wrote:
>> # disk erros during iscsi/drbd migration on kvm host system
>> http://paste.debian.net/119830/
> 
> You need to either use portblock (check the guide I mentioned in my 2/24
> message), or move the IP address to the end of the resource group.

I never read anything about portblock in the linbit guides. I saw a
glimpse of it in the video demonstration and now found the RA:
crm ra info ocf:heartbeat:portblock

I changed the IP address to the end of my resource groups
# group rg_iscsi iscsi0_target iscsi0_lun0 iscsi0_lun1 iscsi0_lun2
ip_virtual0

# root at finley:~# crm configure show
http://paste.debian.net/119911/

Can you share the crm configure show of the drbd cluster showed in the
video demonstration?

> Sure all your KVM block devices are using cache="none"?
file=/dev/lvm1-vol/kvm05-disk,if=none,id=drive-virtio-disk0,boot=on,format=raw,cache=none

# ps auxww | grep /usr/bin/kvm
http://paste.debian.net/119920/

please have a look at the kvm host maybe I missed something or you can
share some tips, they are all debian stable kvm guests

> You may also want to take a look at this guide:
> http://www.linbit.com/en/education/tech-guides/highly-available-virtualization-with-kvm-iscsi-pacemaker/
Have read it many times to see if I missed something...

> http://linuxconfau.blip.tv/file/4719948/
Nice presentation! Thank you for all your time and efforts! Can you
share the crm configure show of these nodes?

So now the new stress testing! Moving the IP address to the end of the
resource group did wonders!!

My siege and bonnie tests just kept running, and the migration was done
in a maybe less then 10 seconds...

I did get some damages to the shared storage...

root at hennessy:~# dmesg
[56951.585704] device-mapper: snapshots: Snapshot is marked invalid.
[56951.590679] Buffer I/O error on device dm-24, logical block 0
..
[57077.664125]  connection1:0: detected conn error (1020)

root at hennessy:~# lvscan
/dev/dm-24: read failed after 0 of 4096 at 4294901760: Input/output error
/dev/dm-24: read failed after 0 of 4096 at 4294959104: Input/output error
/dev/dm-24: read failed after 0 of 4096 at 0: Input/output error
/dev/dm-24: read failed after 0 of 4096 at 4096: Input/output error

But no further damages... Seems to be almost stable enough for production.

The damage was to the lvm snapshot of the high IO load guest only and
the kvm guests kept running.

I am having a nice smile now :D  Any further tips are welcome!

Kind regards,

Jelle de Jong