[ClusterLabs] stonith in dual HMC environment

Mon Mar 27 05:42:13 EDT 2017

Hi,

On Fri, Mar 24, 2017 at 11:01:45AM -0500, Ken Gaillot wrote:
> On 03/22/2017 09:42 AM, Alexander Markov wrote:
> > 
> >> Please share your config along with the logs from the nodes that were
> >> effected.
> > 
> > I'm starting to think it's not about how to define stonith resources. If
> > the whole box is down with all the logical partitions defined, then HMC
> > cannot define if LPAR (partition) is really dead or just inaccessible.
> > This leads to UNCLEAN OFFLINE node status and pacemaker refusal to do
> > anything until it's resolved. Am I right? Anyway, the simples pacemaker
> > config from my partitions is below.
> 
> Yes, it looks like you are correct. The fence agent is returning an
> error when pacemaker tries to use it to reboot crmapp02. From the stderr
> in the logs, the message is "ssh: connect to host 10.1.2.9 port 22: No
> route to host".
> 
> The first thing I'd try is making sure you can fence each node from the
> command line by manually running the fence agent. I'm not sure how to do
> that for the "stonith:" type agents.

There's a program stonith(8). It's easy to replicate the
configuration on the command line.

> Once that's working, make sure the cluster can do the same, by manually
> running "stonith_admin -B $NODE" for each $NODE.
> 
> > 
> > primitive sap_ASCS SAPInstance \
> >     params InstanceName=CAP_ASCS01_crmapp \
> >     op monitor timeout=60 interval=120 depth=0
> > primitive sap_D00 SAPInstance \
> >     params InstanceName=CAP_D00_crmapp \
> >     op monitor timeout=60 interval=120 depth=0
> > primitive sap_ip IPaddr2 \
> >     params ip=10.1.12.2 nic=eth0 cidr_netmask=24
> 
> > primitive st_ch_hmc stonith:ibmhmc \
> >     params ipaddr=10.1.2.9 \
> >     op start interval=0 timeout=300
> > primitive st_hq_hmc stonith:ibmhmc \
> >     params ipaddr=10.1.2.8 \
> >     op start interval=0 timeout=300
> 
> I see you have two stonith devices defined, but they don't specify which
> nodes they can fence -- pacemaker will assume that either device can be
> used to fence either node.

Stonith agents are to be queried for the list of nodes they can
manage. It's part of the interface. Some agents can figure that
out by themself and some need a parameter defining the node list.
This parameter is usually named hostlist, but that is not a
requirement. At any rate, the CRM should get the list of nodes
by invoking the agent and not from the resource configuration. It
is up to the stonith agent to tell what it can manage.

> > group g_sap sap_ip sap_ASCS sap_D00 \
> >     meta target-role=Started
> 
> > location l_ch_hq_hmc st_ch_hmc -inf: crmapp01
> > location l_st_hq_hmc st_hq_hmc -inf: crmapp02
> 
> These constraints restrict which node monitors which device, not which
> node the device can fence.

Well, this used to be a standard way to configure one kind of
stonith resources, one common representative being ipmi, and
served exactly the purpose of restricting the stonith resource
from being enabled ("running") on a node which this resource
manages.

> Assuming st_ch_hmc is intended to fence crmapp01, this will make sure
> that crmapp02 monitors that device -- but you also want something like
> pcmk_host_list=crmapp01 in the device configuration.

pcmk_host_list shouldn't be required for the stonith class
agents.

***

There's a document describing fencing and stonith at clusterlabs.org:

http://clusterlabs.org/doc/crm_fencing.html

If it doesn't hold anymore, then something should be done about
it.

Thanks,

Dejan