[ClusterLabs] Cluster goes to unusable state if fencing resource is down

Fri Mar 18 10:21:58 EDT 2016

On 03/18/2016 02:58 AM, Arjun Pandey wrote:
> Hi
> 
> I am running a 2 node cluster with this config on centos 6.6  where i
> have a multi-state resource foo being run in master/slave mode and  a
> bunch of floating IP addresses configured. Additionally i have a
> collocation constraint for the IP addr to be collocated with the
> master.
> 
> When i configure fencing using fence_ilo4 agents things work fine.
> However during testing i was trying out a case where the ilo cable is
> plugged out. In this case the entire cluster is brought down.
> 
> I understand that this seems to be a safer solution to ensure
> correctness and consistency of the systems. However my requirement was

Exactly. Without working fencing, the cluster can't know whether the
node is really down, or just malfunctioning and possibly still accessing
shared resources.

> to still keep it operational since the application and the floating ip
> are still up. Is there a way to acheive this ?

If fencing fails, and the node is really down, you'd be fine ignoring
the failure. But if the node is actually up, ignoring the failure means
both nodes will activate the floating IP, which will not be operational
(packets will sometimes go to one node, sometimes the other, disrupting
any reliable communication).

> Also considering a case where there is a multi node cluster ( more
> than 10 nodes )  and one of the machines just goes down along with the
> ilo resource for that node. Does it really make sense to bring the
> services down even when the rest of nodes are up ?

It makes sense if data integrity is your highest priority. Imagine a
cluster used by a bank for customer's account balances -- it's far
better to lock up the entire cluster than risk corrupting that data.

The best solution that pacemaker offers in this situation is fencing
topology. You can have multiple fence devices, and if one fails,
pacemaker will try the next.

One common deployment is IPMI as the first level (as you have now), with
an intelligent power switch as the second (backup) level. If IPMI
doesn't respond, the cluster will cut power to the host. Another
possibility is to use an intelligent network switch to cut off network
access to the failed node (if that is sufficient to prevent the node
from accessing any shared resources). If the services being offered are
important enough to require high availability, the relatively small cost
of an intelligent power switch should be easily justified, serving as a
type of insurance.

Not having fencing has such a high chance of making a huge mess that no
company I know of that supports clusters will support a cluster without it.

That said, if you are supporting your own clusters, understand the
risks, and are willing to deal with the worst-case scenario manually,
pacemaker does offer the option to disable stonith. There is no built-in
option to try stonith but ignore any failures. However, it is possible
to configure a fencing topology that does the same thing, if the second
level simply pretends that the fencing succeeded. I'm not going to
encourage that by describing how ;)