[Pacemaker] iscsi migration to slow (disk errors) what to do ...

Tue Jun 14 13:38:50 EDT 2011

On 14-06-11 15:48, Florian Haas wrote:
> On 2011-06-14 15:41, Jelle de Jong wrote:
>> On 14-06-11 15:22, Florian Haas wrote:
>>> On 2011-06-10 17:28, Jelle de Jong wrote:
>>>> The problem is most of my kvm guest file-systems get corrupted when
>>>> migrating my iscsi target on heavy disk load on the kvm guests.
>>> Have you tried setting DefaultTime2Retain like I suggested on Feb 24?
>> # root at godfrey:~# crm configure show
>> http://paste.debian.net/119798/
> 
> DefaultTime2Retain is a parameter that is being negotiated between the
> target and the initiator, and the _minimum_ of proposed
> DefaultTime2Retain values wins. The default DefaultTime2Retain for
> open-iscsi is 0, thus if the initiator proposes 0 and the target 60, 0 wins.
> 
> You'll have to set this on the initiator and the target.

Florian, thank you for taking the time to help! much appreciated!

# root at godfrey:~# tgtadm --lld iscsi --mode target --op show
# root at hennessy:~# iscsiadm --mode node --targetname ... --portal ...
# root at viktoriya:~# iscsiadm -m session -P 1 --show
# root at viktoriya:~# iscsiadm --mode node --targetname ... --portal ...
# root at hennessy:~# cat /etc/iscsi/iscsid.conf
http://paste.debian.net/119805/

# found:
node.session.iscsi.DefaultTime2Retain = 0
node.session.iscsi.DefaultTime2Wait = 2

# stopped open-iscsi added the following started it again:
echo 'node.session.iscsi.DefaultTime2Retain = 60' | tee --append
/etc/iscsi/iscsid.conf
echo 'node.session.iscsi.DefaultTime2Wait = 5' | tee --append
/etc/iscsi/iscsid.conf
rm -rv /etc/iscsi/nodes/*
# reconnected to target, restarted open-iscsi

# iscsiadm --mode node --targetname ...
http://paste.debian.net/119807/

# found:
node.session.iscsi.DefaultTime2Retain = 60
node.session.iscsi.DefaultTime2Wait = 5
node.session.timeo.replacement_timeout = 480
node.conn[0].timeo.noop_out_interval = 15
node.conn[0].timeo.noop_out_timeout = 30

migration test by doing crm node standby on active target
# crm configure show
http://paste.debian.net/119832/
I already had to tune the ocf:heartbeat:iSCSILogicalUnit timeout to 80s.

# repeating error message during migration until migration completes
ERROR: Called "tgtadm --lld iscsi --op delete --mode logicalunit --tid 1
--lun=1"
ERROR: Exit code 22
ERROR: Command output: "tgtadm: this logical unit is still active"

# disk erros during iscsi/drbd migration on kvm host system
http://paste.debian.net/119830/

# lvm logical volume is damaged after this...

# the kvm guest system was running bonnie++ -d /tmp/bonnie/ -n 128
# and the guest reported disk errors and bonnie crashed
# dmesg:
http://paste.debian.net/119831/

Other kvm guest running mysql got corrupted databases.

However no more read-only file-systems on all kvm guests and the file
system damage was recoverable instead of complete destruction after
running fsck in previous tests...

Please advice :) A migraion of the iscsi/drbd target should be possible
on a busy system without damage to the guests?

Thanks in advance,

Kind regards,

Jelle de Jong