[ClusterLabs] stonith in dual HMC environment

Dejan Muhamedagic dejanmm at fastmail.fm
Tue Mar 28 07:06:43 EDT 2017


On Mon, Mar 27, 2017 at 01:17:31PM +0300, Alexander Markov wrote:
> Hello, Dejan,
> 
> 
> >The first thing I'd try is making sure you can fence each node from the
> >command line by manually running the fence agent. I'm not sure how to do
> >that for the "stonith:" type agents.
> >
> >There's a program stonith(8). It's easy to replicate the
> >configuration on the command line.
> 
> Unfortunately, it is not.

Why? I don't have a test system right now, but for instance this
should work:

$ stonith -t ibmhmc ipaddr=10.1.2.9 -lS
$ stonith -t ibmhmc ipaddr=10.1.2.9 -T reset {nodename}

Read the examples in the man page:

$ man stonith

Check also the documentation of your agent:

$ stonith -t ibmhmc -h
$ stonith -t ibmhmc -n

> The landscape I refer to is similar to VMWare. We use cluster for virtual
> machines (LPARs) and everything works OK but the real pain occurs when whole
> host system is down. Keeping in mind that it's actually used now in
> production, I just can't afford to turn it off for test reason.

Yes, I understand. However, I was just talking about how to use
the stonith agents and how to do the testing outside of
pacemaker.

> >Stonith agents are to be queried for the list of nodes they can
> >manage. It's part of the interface. Some agents can figure that
> >out by themself and some need a parameter defining the node list.
> 
> And this is just the place I'm stuck. I've got two stonith devices (ibmhmc)
> for redundancy. Both of them are capable to manage every node.

If so, then your configuration does not appear to be correct. If
both are capable of managing all nodes then you should tell
pacemaker about it. Digimer had a fairly extensive documentation
on how to configure complex fencing configurations. You can also
check with your vendor's documentation.

> The problem starts when
> 
> 1) one stonith device is completely lost and inaccessible (due to power
> outage in datacenter)
> 2) survived stonith device cannot access nor cluster node neither hosting
> system (in VMWare terms) for this cluster node, for both of them are also
> lost due to power outage.

Both lost? What remained? Why do you mention vmware? I thought
that your nodes are LPARs.

> What is the correct solution for this situation?
> 
> >Well, this used to be a standard way to configure one kind of
> >stonith resources, one common representative being ipmi, and
> >served exactly the purpose of restricting the stonith resource
> >from being enabled ("running") on a node which this resource
> >manages.
> 
> Unfortunately, there's no such thing as ipmi in IBM Power boxes.

I mentioned ipmi as an example, not that it has anything to do
with your setup.

> But it
> triggers interesting question for me: if both one node and its complementary
> ipmi device are lost (due to power outage) - what's happening with a
> cluster?

The cluster gets stuck trying to fence the node. Typically this
would render your cluster unusable. There are some IPMI devices
which have a battery to allow for some extra time to manage the
host.

> Survived node, running stonith resource for dead node tries to
> contact ipmi device (which is also dead). How does cluster understand that
> lost node is really dead and it's not just a network issue?

It cannot.

Thanks,

Dejan

> 
> Thank you.
> 
> -- 
> Regards,
> Alexander Markov
> +79104531955
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org




More information about the Users mailing list