[ClusterLabs] stonith in dual HMC environment

Tue Mar 28 16:43:18 UTC 2017

On Tue, Mar 28, 2017 at 09:54:55AM -0500, Ken Gaillot wrote:
> On 03/28/2017 08:20 AM, Alexander Markov wrote:
> > Hello, Dejan,
> > 
> >> Why? I don't have a test system right now, but for instance this
> >> should work:
> >>
> >> $ stonith -t ibmhmc ipaddr=10.1.2.9 -lS
> >> $ stonith -t ibmhmc ipaddr=10.1.2.9 -T reset {nodename}
> > 
> > Ah, I see. Everything (including stonith methods, fencing and failover)
> > works just fine under normal circumstances. Sorry if I wasn't clear
> > about that. The problem occurs only when I have one datacenter (i.e. one
> > IBM machine and one HMC) lost due to power outage.
> 
> If the datacenters are completely separate, you might want to take a
> look at booth. With booth, you set up a separate cluster at each
> datacenter, and booth coordinates which one can host resources. Each
> datacenter must have its own self-sufficient cluster with its own
> fencing, but one site does not need to be able to fence the other.
> 
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#idm139683855002656
> 
> > 
> > For example:
> > test01:~ # stonith -t ibmhmc ipaddr=10.1.2.8 -lS | wc -l
> > info: ibmhmc device OK.
> > 39
> > test01:~ # stonith -t ibmhmc ipaddr=10.1.2.9 -lS | wc -l
> > info: ibmhmc device OK.
> > 39
> > 
> > As I had said stonith device can see and manage all the cluster nodes.
> > 
> >> If so, then your configuration does not appear to be correct. If
> >> both are capable of managing all nodes then you should tell
> >> pacemaker about it.
> > 
> > Thanks for the hint. But if stonith device return node list, isn't it
> > obvious for cluster that it can manage those nodes? Could you please be
> > more precise about what you refer to? I currently changed configuration
> > to two fencing levels (one per HMC) but still don't think I get an idea
> > here.
> 
> I believe Dejan is referring to fencing topology (levels).

Yes, it's just that the name escaped me at the time.  But I'm not
sure which pacemaker version is used and if it supports the
fencing topology.

Thanks,

Dejan

> That would be
> preferable to booth if the datacenters are physically close, and even if
> one fence device fails, the other can still function.
> 
> In this case you'd probably want level 1 = the main fence device, and
> level 2 = the fence device to use if the main device fails.
> 
> A common implementation (which Digimer uses to great effect) is to use
> IPMI as level 1 and an intelligent power switch as level 2. If your
> second device can function regardless of what hosts are up or down, you
> can do something similar.
> 
> > 
> >> Survived node, running stonith resource for dead node tries to
> >> contact ipmi device (which is also dead). How does cluster understand
> >> that
> >> lost node is really dead and it's not just a network issue?
> >>
> >> It cannot.
> 
> And it will be unable to recover resources that were running on the
> questionable partition.
> 
> > 
> > How do people then actually solve the problem of two node metro cluster?
> > I mean, I know one option: stonith-enabled=false, but it doesn't seem
> > right for me.
> > 
> > Thank you.
> > 
> > Regards,
> > Alexander Markov
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org