[ClusterLabs] Mirrored cLVM/Xen PVM Performance question for block device

Tue May 26 00:33:58 EDT 2020

On 5/20/20 2:50 PM, Ulrich Windl wrote:
> Hi!
> 
> I have a performance question regarding delay for reading blocks in a PV Xen VM.
> Forst a little background: Originally to monitor NFS outages, I developed a tool "iotwatch" (short: IOTW) that reads the first block of a block device or file (or anything you can open() and read() with Direct I/O). The tool samples the target at a rather high rate (like 5s), keeping statistics that are queried at a lower rate (like 5 min).
> 
> A wrapper around the tool is used as monitoring plugin, and the outoput looks like this:
> /dev/sys/var: alpha=0.01, count=75(120/120), last=0.0011, avg=0.00423/0.00264/0\
> .00427, min=0.00052(0.00052/0.00084), max=0.02465(0.02465/0.02062), variance=0.\
> 00005(0.00003)|last=0.0011;;;0 exp_avg=0.00427;;;0 emin=0.00084;;;0 emax=0.0206\
> 2;;;0 davg=0.00264;;;0 dstd_dev=0.00617;;;0
> 
> A short explanation what these numbers mean:
> "alpha" is the weight used for exponential averaging (e.g. for "exp_avg"). "count" is the number of samples since last read and the number of samples in the sampling queue (e.g. 120 valid samples ot of a maximum of 120). The values "avg" is average, "min" is the minimum", "max" is the maximum, "variance" is what it says, and "last" is the last sampling value.
> In text output there are three numbers instead of just one, meaning (the indicated value, the average of the value within the sampling queue, and the exponentially averaged value). This is mostly for debugging. The performance data output has just one of those values, selectable via command-line option. Also the statistics can be (in this case they are) reset after it was read, so min and max will start anew...
> 
> OK, that was a rather long story before presenting the details:
> 
> A VM has its root disk on a mirrored LV (cLVM) presented as "phy:", and inside the VM the disk is partitioned like this:
> Device     Boot  Start      End  Sectors  Size Id Type
> /dev/xvdb1 *      2048   411647   409600  200M 83 Linux
> /dev/xvdb2      411648 83886079 83474432 39.8G  5 Extended
> /dev/xvdb5      413696 83886079 83472384 39.8G 8e Linux LVM
> 
> xvdb5 is a PV for the sys VG, like this:
>    opt  sys -wi-ao----   4.00g
>    root sys -wi-ao----   8.00g
>    srv  sys -wi-ao----   4.00g
>    swap sys -wi-ao----   2.00g
>    tmp  sys -wi-ao---- 512.00m
>    var  sys -wi-ao----   6.00g
> 
> LV var is mounted on /var as ext3 (acl,user_xattr). The timing threads runs with prio -80 (nice 0) at SCHED_RR, so I guess other processes won't disturb the measurements much. I see no other threads using a real-time scheduling policy in the VM; system tasks seem to run at prio 0 with some negative nice value instead...
> (On the xen host corosync, DLM and OCFS2 runs with prio -2)
> 
> Now the story: The performance of the root disk inside the VM (IOTW-PV) has a typical read delay of less than 2ms with peaks below 40ms (A comparable local disk in bare metal) would have less than 0.2ms delay with peaks below 7ms).  However when timing the var LV (IOTW_FS), the average is below 4ms with peaks up to 80ms.
> 
> The storage system behind is a FC-based 3PAR StorServ with all SSDs and the service time for reads is (according to the storage system'S own perfomance monitor (SSMC)) significantly below 0.25ms at the same time interval.
> 
> So I wonder: How can LVM in the VM add another 40ms peak to the base timing? The other thing that puzzles  me is this: While the timing for the root disk is basically good with very few peaks, the timing of the LV has mainly three levels: First, most common level is good performance. the next level is like 20ms (more), and the third level are peaks of another 20 or 40 ms.
> 
> Is there any explanation for this? The VM is SLES12 SP5, while the Xen Host is still SLES11 SP4.
> 
> At the moment I'm thinking how to implement VM disks in a way that is efficient while supporting live migration of VMs.
> In the past we were using filesystem images stored in OCFS2 which itself was put in a mirrored cLVM LV. Performance was rather poor, so I skipped the OCFS2 layer and created a separate LV for each VM. Unfortunately mirroring all VM images to different storage systems is an absolute requirement.
> 

Hi Ulrich,

In your use case, the (clustered) LVM2 mirroring layer is known to be a 
performance sensive concern. OCFS2 and SLES12 SP5 VM should not be a 
performance concern in your stack.

I do see an improvement to upgrade your host to SLES12 SP5 if possible. Hence 
you can evolve clustered LVM2 mirroring to clustered md raid1 which intends to 
resolve LVM2 mirroring performance concern. You can play with the following 
migration doc. It should apply to SLES12 SP5 too.

https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-clvm.html#sec-ha-clvm-migrate

Cheers,
Roger

> I'd be glad to get some insights.
> 
> Regards,
> Ulrich
> 
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
>