[ClusterLabs] Fencing with a 3-node (1 for quorum only) cluster

Digimer lists at alteeve.ca
Thu Aug 4 19:33:21 EDT 2016

On 04/08/16 07:21 PM, Dan Swartzendruber wrote:
> On 2016-08-04 19:03, Digimer wrote:
>> On 04/08/16 06:56 PM, Dan Swartzendruber wrote:
>>> I'm setting up an HA NFS server to serve up storage to a couple of
>>> vsphere hosts.  I have a virtual IP, and it depends on a ZFS resource
>>> agent which imports or exports a pool.  So far, with stonith disabled,
>>> it all works perfectly.  I was dubious about a 2-node solution, so I
>>> created a 3rd node which runs as a virtual machine on one of the hosts.
>>> All it is for is quorum.  So, looking at fencing next.  The primary
>>> server is a poweredge R905, which has DRAC for fencing.  The backup
>>> storage node is a Supermicro X9-SCL-F (with IPMI).  So I would be using
>>> the DRAC agent for the former and the ipmilan for the latter?  I was
>>> reading about location constraints, where you tell each instance of the
>>> fencing agent not to run on the node that would be getting fenced.  So,
>>> my first thought was to configure the drac agent and tell it not to
>>> fence node 1, and configure the ipmilan agent and tell it not to fence
>>> node 2.  The thing is, there is no agent available for the quorum node.
>>> Would it make more sense instead to tell the drac agent to only run on
>>> node 2, and the ipmilan agent to only run on node 1?  Thanks!
>> This is a common mistake.
>> Fencing and quorum solve different problems and are not interchangeable.
>> In short;
>> Fencing is a tool when things go wrong.
>> Quorum is a tool when things are working.
>> The only impact that having quorum has with regard to fencing is that it
>> avoids a scenario when both nodes try to fence each other and the faster
>> one wins (which is itself OK). Even then, you can add 'delay=15' the
>> node you want to win and it will win is such a case. In the old days, it
>> would also prevent a fence loop if you started the cluster on boot and
>> comms were down. Now though, you set 'wait_for_all' and you won't get a
>> fence loop, so that solves that.
>> Said another way; Quorum is optional, fencing is not (people often get
>> that backwards).
>> As for DRAC vs IPMI, no, they are not two things. In fact, I am pretty
>> certain that fence_drac is a symlink to fence_ipmilan. All DRAC is (same
>> with iRMC, iLO, RSA, etc) is "IPMI + features". Fundamentally, the fence
>> action; rebooting the node, works via the basic IPMI standard using the
>> DRAC's BMC.
>> To do proper redundant fencing, which is a great idea, you want
>> something like switched PDUs. This is how we do it (with two node
>> clusters). IPMI first, and if that fails, a pair of PDUs (one for each
>> PSU, each PDU going to independent UPSes) as backup.
> Thanks for the quick response.  I didn't mean to give the impression
> that I didn't know the different between quorum and fencing.  The only
> reason I (currently) have the quorum node was to prevent a deathmatch
> (which I had read about elsewhere.)  If it is as simple as adding a
> delay as you describe, I'm inclined to go that route.  At least on
> CentOS7, fence_ipmilan and fence_drac are not the same.  e.g. they are
> both python scripts that are totally different.

The delay is perfectly fine. We've shipped dozens of two-node systems
over the last five or so years and all were 2-node and none have had
trouble. Where node failures have occurred, fencing operated properly
and services were recovered. So in my opinion, in the interest of
minimizing complexity, I recommend the two-node approach.

As for the two agents not being symlinked, OK. It still doesn't change
the core point through that both fence_ipmilan and fence_drac would be
acting on the same target.

Note; If you lose power to the mainboard (which we've seen, failed
mainboard voltage regulator did this once), you lose the IPMI (DRAC)
BMC. This scenario will leave your cluster blocked without an external
secondary fence method, like switched PDUs.


