[ClusterLabs] design of a two-node cluster

Digimer lists at alteeve.ca
Mon Dec 7 16:40:44 EST 2015

On 07/12/15 03:27 PM, Lentes, Bernd wrote:
> Digimer wrote:
>> On 07/12/15 12:35 PM, Lentes, Bernd wrote:
>>> Hi,
>>> i've been asking all around here a while ago. Unfortunately I couldn't
>>> continue to work on my cluster, so I'm still thinking about the
> design.
>>> I hope you will help me again with some recommendations, because
>> when
>>> the cluster is running changing of the design is not possible anymore.
>>> These are my requirements:
>>> - all services are running inside virtual machines (KVM), mostly
>>> databases and static/dynamic webpages
>> This is fine, it's what we do with our 2-node clusters.
>>> - I have two nodes and would like to have some vm's running on node
>> A
>>> and some on node B during normal operation as a kind of loadbalancing
>> I used to do this, but I've since stopped. The reasons are:
>> 1. You need to know that one node can host all servers and still perform
>> properly. By always running on one node, you know that this is the case.
>> Further, if one node ever stops being powerful enough, you will find out
>> early and can address the issue immediately.
>> 2. If there is a problem, you can always be sure which node to terminate
>> (ie: the node hosting all servers gets the fence delay, so the node
>> without servers will always get fenced). If you lose input power, you
> can
>> quickly power down the backup node to shed load, etc.
> Hi Digimer,
> thanks for your reply.
> I don't understand what you want to say in (2).

To prevent a dual fence, where both nodes fence each other when
communication between the nodes fail but the nodes are otherwise
healthy, you need to set a fence delay against one of the nodes. So when
this happens, if the delay is on node 1, this will happen;

Node 1 looks up how to fence node 2, sees no delay and fences
immediately. Node 2 looks up how to fence node 1, sees a delay and
pauses. Node 2 will be dead long before the delay expires, ensuring that
node 2 always loses in such a case. If you have VMs on both nodes, then
no matter which node the delay is on, some servers will be interrupted.

This is just one example. The other, as I mentioned, would be a lost
power condition. Your UPSes can hold up both nodes for a period of time.
If you can shut down one node, you can extend how long the UPSes can
run. So if the power goes out for a period of time, you can immediately
power down one node (the one hosting no servers) without first
live-migrating VMs, which will make things simpler and save time.

Another similar example would be a loss of cooling, where you would want
to shut down nodes to minimize how much heat is being created.

There are other examples, but I think this clarifies what I meant.

>>> - I'd like to keep the setup simple (if possible)
>> There is a minimum complexity in HA, but you can get as close as
>> possible. We've spent years trying to simplify our VM hosting clusters
> as
>> much as possible.
>>> - availability is important, performance not so much (webpages some
>>> hundred requests per day, databases some hundred inserts/selects
>> per
>>> day)
>> All the more reason to consolidate all VMs on one host.
>>> - I'd like to have snapshots of the vm's
>> This is never a good idea, as you catch the state of the disk at the
> point of
>> the snapshot, but not RAM. Anything in buffers will be missed so you can
>> not rely on the snapshot images to always be consistent or even
>> functional.
>>> - live migration of the vm's should be possible
>> Easy enough.
>>> - nodes are SLES 11 SP4, vm's are Windows 7 and severable linux
>>> distributions (Ubuntu, SLES, OpenSuSE)
>> The OS installed on the guest VMs should not factor. As for the node OS,
>> SUSE invests in making sure that HA works well so you should be fine.
>>> - setup should be extensible (add further vm's)
>> That is entirely a question of available hardware resources.
>>> - I have a shared storage (FC SAN)
>> Personally, I prefer DRBD (truly replicated storage), but SAN is fine.
>>> My ideas/questions:
>>> Should I install all vm's in one partition or every vm in a seperate
>>> partition ? The advantage of one vm per partition is that I don't need
>>> a cluster fs, right ?
>> I would put each VM on a dedicated LV and not have an FS between the
>> VM and the host. The question then becomes; What is the PV? I use
>> clustered LVM to make sure all nodes are in sync, LVM-wise.
> Is this the setup you are running (without fs) ?

Yes, we use DRBD to replicate the storage and use the /dev/drbdX device
as the clustered LVM PV. We have one VG for the space (could add a new
DRBD resource later if needed...) and then create a dedicated LV per VM.
We have, as I mentioned, one small LV formatted with gfs2 where we store
the VM's XML files (so that any change made to a VM is immediately
available to all nodes.

>>> I read to avoid a cluster fs if possible because it adds further
>>> complexity. Below the fs I'd like to have logical volumes because they
>>> are easy to expand.
>> Avoiding clustered FS is always preferable, yes. I use a small gfs2
>> partition, but this is just for storing VM XML data, install media, etc.
>> Things that change rarely. Some advocate for having independent FSes
>> on each node and keeping the data in sync using things like rsync or
> what
>> have you.
>>> Do I need cLVM (I think so) ? Is it an advantage to install the vm's
>>> in plain partitions, without a fs ?
>> I advise it, yes.
>>> It would reduce the complexity further because I don't need a fs.
>>> Would live migration still be possible ?
>> Live migration is possible provided both nodes can see the same physical
>> storage at the same time. For example, DRBD dual-primary works. If you
>> use clustered LVM, you can be sure that the backing LVs are the same
>> across the nodes.
> And this works without a cluster fs ? But when both nodes accesses the LV
> concurrently (during the migration), will the data not be destroyed ?
> cLVM does not control concurrent access, it just cares about propagating
> the lvm metadata to all nodes and locking during changes of the metadata.

The data access is managed by the live migration process, so it's
managed properly. All the LV does is create a container for the VM's
data. Clustered LVM doesn't prevent concurrent access, so if you do
something silly like create an ext4 FS on a clustered LV and try to
mount it in two places, you can. All clustered LVM does is make sure
that, as LVM things change, all nodes know about the changes immediately.

>>> snapshots:
>>> I was playing around with virsh (libvirt) to create snapshots of the
> vm's.
>>> In the end I gave up. virsh explains commands in its help, but when
>>> you want to use them you get messages like "not supported yet",
>>> although I use libvirt 1.2.11. This is ridiculous. I think I will
>>> create my snapshots inside the vm's using lvm.
>>> We have a network based backup solution (Legato/EMC) which saves
>> the
>>> disks every night.
>>> Supplying a snapshot for that I have a consistent backup. The
>>> databases are dumped with their respective tools.
>>> Thanks in advance.
>> I don't recommend snapshots, as I mentioned. Focus on your backup
>> application and create DR VMs if you want to minimize the time to
>> recovery after a total VM loss is what I recommend.
> What do you mean with DR ?

Disaster recovery. Create a matching VM somewhere else that can run the
same services and back up to it periodically. Then, if the primary site
is destroyed, the DR site / VMs are ready to go. This is NOT backup
though; the DR VMs are sync'ed and should always match production (as of
the last sync time). Backups are still needed to do things like recover
accidentally deleted or changed files.

> Bernd 
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
> Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Dr. Nikolaus Blum, Dr. Alfons Enhsen
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

More information about the Users mailing list