[ClusterLabs] Fencing Two-Node Cluster on Two ESXi Hosts

Thu Nov 5 11:04:53 EST 2015

On 11/05/2015 02:43 AM, Gonçalo Lourenço wrote:
> Greetings, everyone!
> 
> 
> I'm having some trouble understanding how to properly setup fencing in my two-node cluster (Pacemaker + Corosync). I apologize beforehand if this exact question has been answered in the past, but I think the intricacies of my situation might be interesting enough to warrant yet another thread on this matter!

No apology needed! Fencing is both important and challenging for a
beginner, and I'd much rather see a question about fencing every week
than see someone disable it.

> My setup consists of two virtual machines (the nodes of the cluster) running on two separate VMWare ESXi servers: VM 1 (CentOS 7) is running on ESXi 1; VM 2 (another CentOS 7) is on ESXi 2. I have all resources except for fencing running as intended (DRBD, a virtual IP address, and a DHCP server). I have no access to any additional computing resources, both physical and virtual. Both nodes use one NIC for DRBD and Corosync (since it's a virtual environment, I thought this would be enough) and another one used exclusively for the DHCP server.

One NIC for DRBD+corosync is fine. You can configure DRBD not to use the
full bandwidth, so corosync always has some breathing room.

> My idea for fencing this two-node cluster is the following:
> . Setup one VMWare SOAP fencing agent on VM 1 that talks to ESXi 1. This agent would run exclusively on VM 1 and would only serve to fence VM 2;
> . Another VMWare SOAP fencing agent on VM 2 that'll talk to ESXi 2. Yet again, this agent would run solely on VM 2 and would only fence VM 1.
> 
> Basically, the idea is to have them fence one another through the ESXi host they're running on.
> Is this the right way to go? If so, how should I configure the fencing resource? If not, what should I change?
> 
> Thank you for your time.
> 
> 
> Kind regards,
> Gonçalo Lourenço

I'm not familiar enough with VMWare to address the specifics, but that
general design is a common approach in a two-node cluster. It's a great
first pass for fencing: if there's a problem in one VM, the other will
fence it.

However what if the underlying host is not responsive? The other node
will attempt to fence but get a timeout, and so the cluster will refuse
to run any resources ("better safe than sorry").

The usual solution is to have two levels of fencing: the first as you
suggested, then another for the underlying host in case that fails.

The underlying hosts probably have IPMI, so you could use that as a
second level without needing any new hardware. If the underlying host OS
is having trouble, the other node can contact the IPMI and power-kill it.

However if IPMI shares power with the host (i.e. on-board as opposed to
a separate unit on a blade chassis), then you still have no recovery if
power fails. The most common solution is to use an intelligent power
switch, whether as the second level, or as a third level after IPMI. If
that's not an option, VM+IPMI fencing will still cover most of your
bases (especially if the physical hosts have redundant power supplies).

Be sure to use "two_node: 1" in corosync.conf (assuming you're using
corosync 2). That will allow one node to keep quorum if the other is
shut down.