[ClusterLabs] STONITH when both IB interfaces are down, and how to trigger Filesystem mount/umount failure to test STONITH?

Wed Aug 19 06:31:39 EDT 2015

Hi,

i have two questions stated in the email's subject, but let me describe my
system first.

I have a Lustre over infiniband setup constiting of mgs, mds, and two oss,
each oss has two ost's, but the questions are not specific to Lustre.
Each server has two IPoIB interfaces which provide multipath redundancy to
the SAN block devices.
I'm using the crm configuration generated by the make-lustre-crm-config.py
script
available at https://github.com/gc3-uzh-ch/schroedinger-lustre-ha
After some changes (hostnames, IPs, and the fact that in my setup I have
two IPoIB interfaces
instead of just one), the script creates the attached crm.txt.

I'm familiar with https://ourobengr.com/ha/ , which says:
"If a stop (umount of the Lustre filesystem in this case) fails,
the node will be fenced/STONITHd because this is the only safe thing to do".

I have a working STONITH, with corosync communicating over eth0 interface.
Let's take the example of server-02, which mounts Lustre's mdt.
The server-02 is powered-off if I disable the eth0 interface on it,
and mdt moves onto server-01 as expected.
However if instead both IPoIB interfaces go down on server-02,
the mdt is moved to server-01, but no STONITH is performed on server-02.
This is expected, because there is nothing in the configuration that
triggers
STONITH in case of IB connection loss.
Hovever if IPoIB is flapping this setup could lead to mdt moving
back and forth between server-01 and server-02.
Should I have STONITH shutting down a node that misses both IpoIB
(remember they are passively redundant, only one active at a time)
interfaces?
If so, how to achieve that?

The context for the second question: the configuration contains the
following Filesystem template:

rsc_template lustre-target-template ocf:heartbeat:Filesystem \
  op monitor interval=120 timeout=60 OCF_CHECK_LEVEL=10 \
  op start   interval=0   timeout=300 on-fail=fence \
  op stop    interval=0   timeout=300 on-fail=fence

How can I make umount/mount of Filesystem fail in order to test STONITH
action in these cases?

Extra question: where can I find the documentation/source what
on-fail=fence is doing?
Or what does it mean on-fail=stop in the ethmonitor template below (what is
stopped?)?

rsc_template netmonitor-30sec ethmonitor \
  params repeat_count=3 repeat_interval=10 \
  op monitor interval=15s timeout=60s \
  op start   interval=0s  timeout=60s on-fail=stop \

Marcin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20150819/5c6df5ce/attachment-0002.html>
-------------- next part --------------
# each of 4 servers has it's own IPv4 address in /etc/corosync/corosync.conf:
# bindnetaddr: 192.168.1.11 server-01
# bindnetaddr: 192.168.1.12 server-02
# bindnetaddr: 192.168.1.13 server-03
# bindnetaddr: 192.168.1.14 server-04

#
# Set default resource "stickiness" 2000,
# so that Lustre targets won't move
# to another server unless there is
# a sysadmin taking a move action.
# A value < 1000 will result in an automatic recovery from failover
#
rsc_defaults rsc-options: \
  resource-stickiness=200

#
# Provide a template for monitoring network interfaces.
# An interface will be considered DOWN if it fails 3 checks
# separated by a 10 interval.
#
rsc_template netmonitor-30sec ethmonitor \
  params repeat_count=3 repeat_interval=10 \
  op monitor interval=15s timeout=60s \
  op start   interval=0s  timeout=60s on-fail=stop \
  op stop    interval=0s on-fail=stop

#
# For each host, define how we STONITH that host.
#
rsc_template stonith-template stonith:fence_ipmilan   params     pcmk_host_check=static-list     pcmk_host_list="invalid"     ipaddr="invalid"     action=off     login=login passwd=passwd     verbose=true lanplus=true     op monitor interval=60s

# ipaddr below are the ones of the ipmi devices

primitive stonith-server-04 @stonith-template \
  params \
    pcmk_host_check=static-list \
    pcmk_host_list=server-04.domain.com \
    ipaddr="10.0.1.104"

primitive stonith-server-03 @stonith-template \
  params \
    pcmk_host_check=static-list \
    pcmk_host_list=server-03.domain.com \
    ipaddr="10.0.1.103"

primitive stonith-server-02 @stonith-template \
  params \
    pcmk_host_check=static-list \
    pcmk_host_list=server-02.domain.com \
    ipaddr="10.0.1.102"

primitive stonith-server-01 @stonith-template \
  params \
    pcmk_host_check=static-list \
    pcmk_host_list=server-01.domain.com \
    ipaddr="10.0.1.101"

#
# check that the `eth0.617` interface is up;
# it provides access to the IPMI network,
# which is used for STONITH
#
primitive ipmi_net_up @netmonitor-30sec \
   params interface=eth0 name=ipmi_net_up

clone ipmi_net_up_clone ipmi_net_up \
  meta globally-unique=true ordered=false notify=false interleave=true clone-node-max=1

#
# A STONITH resource can run on any node that has access to the IPMI network.
# However, avoid that a host is chosen as its own killer.
#

location locate-stonith-server-04 stonith-server-04 \
  rule $id=stonith-server-04-not-on-self -INFINITY: #uname eq server-04.domain.com \
  rule $id=stonith-server-04-with-ipmi   -INFINITY: ipmi_net_up eq 0

location locate-stonith-server-03 stonith-server-03 \
  rule $id=stonith-server-03-not-on-self -INFINITY: #uname eq server-03.domain.com \
  rule $id=stonith-server-03-with-ipmi   -INFINITY: ipmi_net_up eq 0

location locate-stonith-server-02 stonith-server-02 \
  rule $id=stonith-server-02-not-on-self -INFINITY: #uname eq server-02.domain.com \
  rule $id=stonith-server-02-with-ipmi   -INFINITY: ipmi_net_up eq 0

location locate-stonith-server-01 stonith-server-01 \
  rule $id=stonith-server-01-not-on-self -INFINITY: #uname eq server-01.domain.com \
  rule $id=stonith-server-01-with-ipmi   -INFINITY: ipmi_net_up eq 0

#
# check that the `ib0` interface is up
#
primitive ib0_up @netmonitor-30sec \
  params interface=ib0 name=ib0_up link_status_only=true infiniband_device=qib0 infiniband_port=1

clone ib0_up_clone ib0_up \
  meta globally-unique=true ordered=false notify=false interleave=true clone-node-max=1

#
# check that the `ib1` interface is up
#
primitive ib1_up @netmonitor-30sec \
  params interface=ib1 name=ib1_up link_status_only=true infiniband_device=qib0 infiniband_port=2

clone ib1_up_clone ib1_up \
  meta globally-unique=true ordered=false notify=false interleave=true clone-node-max=1

#
# check IB connectivity towards all other nodes
#

# host_list below contains IPoIB addresses

primitive ping ocf:pacemaker:ping \
    params name=ping dampen=5s multiplier=10 host_list="192.168.0.10 192.168.0.11 192.168.0.12 192.168.0.13 192.168.0.14 192.168.0.15 192.168.0.16 192.168.0.17" \
    op start timeout=120 on-fail=stop \
    op monitor timeout=120 interval=10 \
    op stop timeout=20 on-fail=stop

clone ping_clone ping \
    meta globally-unique=false clone-node-max=1

#
# The `Filesystem` RA checks that a device is readable
# and that a filesystem is mounted. We use it to manage
# the Lustre OSTs.
#
rsc_template lustre-target-template ocf:heartbeat:Filesystem \
  op monitor interval=120 timeout=60 OCF_CHECK_LEVEL=10 \
  op start   interval=0   timeout=300 on-fail=fence \
  op stop    interval=0   timeout=300 on-fail=fence

primitive lustrefs-ost0000 @lustre-target-template \
  params device="/dev/mapper/lustrefs-ost0000" directory="/mnt/lustre/lustrefs-ost0000" fstype="lustre"

primitive lustrefs-ost0001 @lustre-target-template \
  params device="/dev/mapper/lustrefs-ost0001" directory="/mnt/lustre/lustrefs-ost0001" fstype="lustre"

primitive lustrefs-ost0002 @lustre-target-template \
  params device="/dev/mapper/lustrefs-ost0002" directory="/mnt/lustre/lustrefs-ost0002" fstype="lustre"

primitive lustrefs-ost0003 @lustre-target-template \
  params device="/dev/mapper/lustrefs-ost0003" directory="/mnt/lustre/lustrefs-ost0003" fstype="lustre"

primitive mdt @lustre-target-template \
  params device="/dev/mapper/lustrefs-mdt0000" directory="/mnt/lustre/lustrefs-mdt0000" fstype="lustre"

primitive mgt @lustre-target-template \
  params device="/dev/mapper/mgt2" directory="/mnt/lustre/mgt2" fstype="lustre"

#
# Bind OST locations to hosts that can actually support them.
#

location lustrefs-ost0000-location lustrefs-ost0000 \
  rule $id="lustrefs-ost0000_secondary_on_4" 100: #uname eq server-04.domain.com \
  rule $id="lustrefs-ost0000_primary_on_3" 1000: #uname eq server-03.domain.com \
  rule $id="lustrefs-ost0000_not_on_2" -INFINITY: #uname eq server-02.domain.com \
  rule $id="lustrefs-ost0000_not_on_1" -INFINITY: #uname eq server-01.domain.com \
  rule $id="lustrefs-ost0000_only_if_ib_up"     -INFINITY: ib0_up eq 0 and ib1_up eq 0 \
  rule $id="lustrefs-ost0000_only_if_ping_works" -INFINITY: not_defined ping or ping number:lte 0

location lustrefs-ost0001-location lustrefs-ost0001 \
  rule $id="lustrefs-ost0001_primary_on_4" 1000: #uname eq server-04.domain.com \
  rule $id="lustrefs-ost0001_secondary_on_3" 100: #uname eq server-03.domain.com \
  rule $id="lustrefs-ost0001_not_on_2" -INFINITY: #uname eq server-02.domain.com \
  rule $id="lustrefs-ost0001_not_on_1" -INFINITY: #uname eq server-01.domain.com \
  rule $id="lustrefs-ost0001_only_if_ib_up"     -INFINITY: ib0_up eq 0 and ib1_up eq 0 \
  rule $id="lustrefs-ost0001_only_if_ping_works" -INFINITY: not_defined ping or ping number:lte 0

location lustrefs-ost0002-location lustrefs-ost0002 \
  rule $id="lustrefs-ost0002_secondary_on_4" 100: #uname eq server-04.domain.com \
  rule $id="lustrefs-ost0002_primary_on_3" 1000: #uname eq server-03.domain.com \
  rule $id="lustrefs-ost0002_not_on_2" -INFINITY: #uname eq server-02.domain.com \
  rule $id="lustrefs-ost0002_not_on_1" -INFINITY: #uname eq server-01.domain.com \
  rule $id="lustrefs-ost0002_only_if_ib_up"     -INFINITY: ib0_up eq 0 and ib1_up eq 0 \
  rule $id="lustrefs-ost0002_only_if_ping_works" -INFINITY: not_defined ping or ping number:lte 0

location lustrefs-ost0003-location lustrefs-ost0003 \
  rule $id="lustrefs-ost0003_primary_on_4" 1000: #uname eq server-04.domain.com \
  rule $id="lustrefs-ost0003_secondary_on_3" 100: #uname eq server-03.domain.com \
  rule $id="lustrefs-ost0003_not_on_2" -INFINITY: #uname eq server-02.domain.com \
  rule $id="lustrefs-ost0003_not_on_1" -INFINITY: #uname eq server-01.domain.com \
  rule $id="lustrefs-ost0003_only_if_ib_up"     -INFINITY: ib0_up eq 0 and ib1_up eq 0 \
  rule $id="lustrefs-ost0003_only_if_ping_works" -INFINITY: not_defined ping or ping number:lte 0

location mdt-location mdt \
  rule $id="mdt_not_on_4" -INFINITY: #uname eq server-04.domain.com \
  rule $id="mdt_not_on_3" -INFINITY: #uname eq server-03.domain.com \
  rule $id="mdt_primary_on_2" 1000: #uname eq server-02.domain.com \
  rule $id="mdt_secondary_on_1" 100: #uname eq server-01.domain.com \
  rule $id="mdt_only_if_ib_up"     -INFINITY: ib0_up eq 0 and ib1_up eq 0 \
  rule $id="mdt_only_if_ping_works" -INFINITY: not_defined ping or ping number:lte 0

location mgt-location mgt \
  rule $id="mgt_not_on_4" -INFINITY: #uname eq server-04.domain.com \
  rule $id="mgt_not_on_3" -INFINITY: #uname eq server-03.domain.com \
  rule $id="mgt_secondary_on_2" 100: #uname eq server-02.domain.com \
  rule $id="mgt_primary_on_1" 1000: #uname eq server-01.domain.com \
  rule $id="mgt_only_if_ib_up"     -INFINITY: ib0_up eq 0 and ib1_up eq 0 \
  rule $id="mgt_only_if_ping_works" -INFINITY: not_defined ping or ping number:lte 0

#
# Set order constraints so that Lustre targets are only
# started *after* IB is up.
#

order lustrefs-ost0000-after-ib-up Mandatory: ib0_up_clone ib1_up_clone lustrefs-ost0000
order lustrefs-ost0001-after-ib-up Mandatory: ib0_up_clone ib1_up_clone lustrefs-ost0001
order lustrefs-ost0002-after-ib-up Mandatory: ib0_up_clone ib1_up_clone lustrefs-ost0002
order lustrefs-ost0003-after-ib-up Mandatory: ib0_up_clone ib1_up_clone lustrefs-ost0003
order mdt-after-ib-up Mandatory: ib0_up_clone ib1_up_clone mdt
order mgt-after-ib-up Mandatory: ib0_up_clone ib1_up_clone mgt

#
# Serialize mounting of Lustre targets,
# see: https://jira.hpdd.intel.com/browse/LU-1279
#

order serialize_targets_on_server-01-and-server-02 Serialize: mgt mdt symmetrical=false
order serialize_targets_on_server-03-and-server-04 Serialize: lustrefs-ost0003 lustrefs-ost0002 lustrefs-ost0001 lustrefs-ost0000 symmetrical=false

order mdt_after_mgt Optional: mgt mdt

order lustrefs-ost0000_after_mdt Optional: mdt lustrefs-ost0000
order lustrefs-ost0001_after_mdt Optional: mdt lustrefs-ost0001
order lustrefs-ost0002_after_mdt Optional: mdt lustrefs-ost0002
order lustrefs-ost0003_after_mdt Optional: mdt lustrefs-ost0003

property cib-bootstrap-options: \
  stonith-enabled=true \
  stonith-action=poweroff \
  maintenance-mode=false