[ClusterLabs] Antw: Re: Antw: Re: principal questions to a two-node cluster

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Thu Apr 23 02:55:48 EDT 2015

>>> "Lentes, Bernd" <bernd.lentes at helmholtz-muenchen.de> schrieb am 22.04.2015
16:17 in Nachricht
<15785B7E063D464C86DD482FCAE4EBA501CAC3CA528C at XCH11.scidom.de>:

> Hi,
> ok. I've been asked for more precise information. Here they are:
> I'd like to create a two-node cluster with a SAN. I have two nodes (HP 
> ProLiant with SLES 11 SP3 64bit) and a HP FC SAN. My ressources are some 
> vm's. Also some databases and web applications. Maybe I will install the 

Hi Bernd,

we have a similar scenario, so maybe it may help to read this summary. We
started running Xen paravirtualized VMs in SLES10 without cluster. at that time
every VM had ist disk as a HP EVA virtual disk presented via iSCSI. Indide each
VMs' disk we were using LVM.

When switching to SLES11 and cluster we thought (with experience from HP-UX
ServiceGuard and their LVM) that we could easily use mirrored LVM to provide a
logical volume for an OCFS2 filesystem that would host the VM images (we had
two SAN storage systems at two locations for desaster tolerance, and all data
should be written to both systems). We learned that (at the time of SP1) LVM
mirroring needs an extra disk to remember its mirroring state (while HP's LVM
had it on each disk), and that MD-RAID couldn't be used if the RAID was active
on multiple nodes.

So we did something completely different: We had a VDisks from each EVA  for
every VM, and we presented these VDisks to a virtual WWN (NPIV with Brodade
HBA). That required some extra code as Brocade's HBAs weren't supported by Xen
for NPIV at that time. So to start a VM, we added the virtual WWN for the VM to
each HBA (actually two WWNs per machine) which triggered the creation of a
virtual SCSI HBA that showed the disks from EVA. Multipath bundles the paths
into one device. Then be booted Xen VM from one of these disks. Every mirroring
was done inside the VM using MD-RAID. The VGs for each VM had to have different
names as they are visible in the Xen host. When we got a newer storage system,
we found out that the timing after adding a virtual WWN was quite different,
and sometimes the disks were still not present when multipath checked for them
(when trying to remove the virtual WWN we found multipath would prevent it if
the map is still active, so we had an RA for handling the multipath maps).

Eventually we switched to the new storage system with a different approach: We
used cLVM to mirror a VG that will provide the LV for a OCFS2 filesystem that
will contain the VM images (and virtual DVD media). Due to the problem
described above we started without a mirror bitmap (which was a bad idea,
because after a crash, cLVM started to mirror the complete 300GB LV, bringing
I/O almost to a halt). Now there was only one disk for each VM, because the
OCFS2 is already mirrored by LVM. (At SP3 you can now have the bitmap on the
mirror legs, and we added that).

So our approach has the advantage of scalability: If you add another node to
the cluster, just extend cLVM and OCFS, and you are ready to run the VMs on the
new node (simplified, of course!). And you can run VMs on any node, and you can
live migrate them. If you start with a more simple approach where your
filesystem is non-clustered, you can use MD-RAID for mirroring, you don't need
DLM, cLVM, OCFS2, but you can only run all your machines on one node at a time,
and you cannot do live migration. So if you want to spread your VMs later, you
have to redesign...

With this approach where all the applications run inside VMs, you can quickly
reboot a VM (30 seconds, maybe) while rebooting a HP DL380 G7 with lots of RAM
and two CPUs takes about 5 minutes. So all you have to do is make sure that
your VM starts everything you need at boot time (using standard mechanisms).
You might even use simple check software like monit to restart failed
applications inside the VM.

> databases and applications also into vm's, I don't know currently exactly.
> I'd like to have availability, no active/active or master/slave. 
> Active/passive should be sufficient

So it seems you want the simplified approach. Of course you could have one
VDisk (and one MD-RAID) per VM, so you could spread your VMS across nodes, but
you'll have to duplicate images of virtual media (unless they are extern).

> What I'd like to know:
> - If I will use logical volumes, do I need cLVM ? Or is "normal" LVM 
> sufficient ?

Depends. See above.

> - snapshots with cLVM now seems to work: 
> https://www.suse.com/releasenotes/x86_64/SLE-HA/11-SP3/#id303161 (I will use

> SLES 11 SP3)
> - I'd like to have snapshots. Freezing systems before applying software or 
> changing configuration is fine. How can I achieve it ? I don't like to use 

Depending on the approach, and the level of snapshots you want: Do you want
tho have a snapshot of all VMs, or a single VM, of a single filesystem. What do
you do with the snapshot?

> btrfs on production systems. Ext4 snapshots are experimental. OCFS2 has file

> snapshots. But most of you propose not use OCFS2. cLVM offers snapshots. Any

> other idea ?

See above.

> - How is the setup with cLVM ? This is my idea:  SAN offers one or more 
> volumes to the nodes ==> disks which appears in the nodes from the SAN are 
> used as physical volumes ==> physical volume/s are combined to a volume
>==> logical volumes are created on top of this vg, one lv for every virtual 
> machine ==> formatted with a file system like ext3 (because each virtual 
> machine does not run concurrently on both nodes this should work) ==> each 
> virtual machine in a raw file, one vm for each lv . Does that work ?
> to run different virtual machines on different nodes concurrently (as a poor

> load balancing) ?

See above also.

> Thanks for every hint.

I know my description is just a rough sketch, but maybe it helps you to pick
the variant you feel most comfortable with...


> Bernd
> Helmholtz Zentrum München
> Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH)
> Ingolstädter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de 
> Aufsichtsratsvorsitzende: MinDir´in Bärbel Brumme-Bothe
> Geschäftsführer: Prof. Dr. Günther Wess, Dr. Nikolaus Blum, Dr. Alfons 
> Enhsen
> Registergericht: Amtsgericht München HRB 6466
> USt-IdNr: DE 129521671
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 

More information about the Users mailing list