[ClusterLabs] Pacemaker not always selecting the right stonith device

Martin Schlegel martin at nuboreto.org
Tue Jul 19 15:33:55 UTC 2016


> Date: Tue, 19 Jul 2016 08:52:19 -0500
> From: Ken Gaillot <kgaillot at redhat.com>
> To: users at clusterlabs.org
> Subject: Re: [ClusterLabs] Pacemaker not always selecting the right
>  stonith device
> Message-ID: <bdb806b2-65d1-7fc5-4cca-6d03047d3606 at redhat.com>
> Content-Type: text/plain; charset=utf-8
> 
> On 07/18/2016 05:51 PM, Martin Schlegel wrote:
> 
> > Hello all
> > 
> > I cannot wrap my brain around what's going on here ... any help would
> > prevent me
> > from fencing my brain =:-D
> > 
> > Problem:
> > 
> > When completely network isolating a node, i.e. pg1 - sometimes a different
> > node
> > gets fenced instead, i.e. pg3 ... in this case I see a syslog message like
> > this
> > indicating the wrong stonith device was used:
> >  stonith-ng[4650]: notice: Operation 'poweroff' [6216] (call 2 from
> > crmd.4654) for host 'pg1' with device 'p_ston_pg3' returned: 0 (OK)
> > 
> > I had assumed that only the stonith resource p_ston_pg1 had hostname=pg1 and
> > was
> > the only resource eligible to be used to fence pg1 !
> > 
> > Why would it use p_ston_pg3 then ?
> > 
> > Configuration summary - more details and logs below:
> > 
> >  * 3x nodes pg1, pg2 and pg3 
> >  * 3x stonith resources p_ston_pg1, p_ston_pg2 and p_ston_pg3 - one for each
> > node
> >  * symmetric-cluster=false (!), please see location constraints
> > l_pgs_resources
> > and l_ston_pg1, l_ston_pg2 & l_ston_pg3 further below
> >  * We rely on /etc/hosts to resolve pg1, pg2 and pg3 for corosync - the
> > actual
> > hostnames are completely different
> >  * We rely on the option "hostname" for stonith:external/ipmi to specify the
> > name of the host to be managed by the defined STONITH device.
> > 
> > The stonith registration looks wrong to me (?) - I expected 1 single stonith
> > device to be registered per host - see crm_mon output - only 1 p_ston_pgX
> > resource gets started per host (!):
> > 
> > root at test123:~# for node in pg{1..3} ; do ssh $node stonith_admin -L ; done
> > Warning: Permanently added 'pg1,10.148.128.28' (ECDSA) to the list of known
> > hosts.
> > 2 devices found
> >  p_ston_pg3
> >  p_ston_pg2
> > Warning: Permanently added 'pg2,10.148.128.7' (ECDSA) to the list of known
> > hosts.
> > 2 devices found
> >  p_ston_pg3
> >  p_ston_pg1
> > Warning: Permanently added 'pg3,10.148.128.37' (ECDSA) to the list of known
> > hosts.
> > 2 devices found
> >  p_ston_pg1
> >  p_ston_pg2
> > 
> > ... and for the host pg1 (same as for pg2 or pg3) 2x devices are found to
> > fence
> > off pg1 - I would expect only 1 device to show up:
> > 
> > root at test123:~# for node in pg{1..3} ; do ssh $node stonith_admin -l pg1 ;
> > done
> > 
> > Warning: Permanently added 'pg1,10.148.128.28' (ECDSA) to the list of known
> > hosts.
> > 2 devices found
> >  p_ston_pg3
> >  p_ston_pg2
> > 
> > Warning: Permanently added 'pg2,10.148.128.7' (ECDSA) to the list of known
> > hosts.
> > 2 devices found
> >  p_ston_pg1
> >  p_ston_pg3
> > 
> > Warning: Permanently added 'pg3,10.148.128.37' (ECDSA) to the list of known
> > hosts.
> > 2 devices found
> >  p_ston_pg1
> >  p_ston_pg2
> > 
> > crm_mon monitor output:
> > 
> > root at test123:~# crm_mon -1
> > Last updated: Mon Jul 18 22:45:00 2016 Last change: Mon Jul 18 20:52:14
> > 2016 by root via cibadmin on pg2
> > Stack: corosync
> > Current DC: pg1 (version 1.1.14-70404b0) - partition with quorum
> > 3 nodes and 25 resources configured
> > 
> > Online: [ pg1 pg2 pg3 ]
> > 
> >  p_ston_pg1 (stonith:external/ipmi): Started pg2
> >  p_ston_pg2 (stonith:external/ipmi): Started pg3
> >  p_ston_pg3 (stonith:external/ipmi): Started pg1
> > 
> > Configuration:
> > 
> > [...]
> > 
> > primitive p_ston_pg1 stonith:external/ipmi \
> >  params hostname=pg1 ipaddr=10.148.128.35 userid=root
> > passwd="/var/vcap/data/packages/pacemaker/ra-tmp/stonith/PG1-ipmipass"
> > passwd_method=file interface=lan priv=OPERATOR
> > 
> > primitive p_ston_pg2 stonith:external/ipmi \
> >  params hostname=pg2 ipaddr=10.148.128.19 userid=root
> > passwd="/var/vcap/data/packages/pacemaker/ra-tmp/stonith/PG2-ipmipass"
> > passwd_method=file interface=lan priv=OPERATOR
> > 
> > primitive p_ston_pg3 stonith:external/ipmi \
> >  params hostname=pg3 ipaddr=10.148.128.59 userid=root
> > passwd="/var/vcap/data/packages/pacemaker/ra-tmp/stonith/PG3-ipmipass"
> > passwd_method=file interface=lan priv=OPERATOR
> > 
> > location l_pgs_resources { otherstuff p_ston_pg1 p_ston_pg2 p_ston_pg3 }
> > resource-discovery=exclusive \
> >  rule #uname eq pg1 \
> >  rule #uname eq pg2 \
> >  rule #uname eq pg3
> > 
> > location l_ston_pg1 p_ston_pg1 -inf: pg1
> > location l_ston_pg2 p_ston_pg2 -inf: pg2
> > location l_ston_pg3 p_ston_pg3 -inf: pg3
> 
> These constraints prevent each device from running on its intended
> target, but they don't limit which nodes each device can fence. For
> that, each device needs a pcmk_host_list or pcmk_host_map entry, for
> example:
> 
>  primitive p_ston_pg1 ... pcmk_host_map=pg1:pg1.ipmi.example.com
> 
> Use pcmk_host_list if the fence device needs the node name as known to
> the cluster, and pcmk_host_map if you need to translate a node name to
> an address the device understands.


We used the parameter "hostname". What does it do if not that ? Please see the
info on this resource below ?


IPMI STONITH device (stonith:external/ipmi)

ipmitool based power management. Apparently, the power off
method of ipmitool is intercepted by ACPI which then makes
a regular shutdown. If case of a split brain on a two-node
it may happen that no node survives. For two-node clusters
use only the reset method.

Parameters (*: required, []: default):

hostname (string): Hostname
    The name of the host to be managed by this STONITH device.


> 
> > [...]
> > 
> > property cib-bootstrap-options: \
> >  have-watchdog=false \
> >  dc-version=1.1.14-70404b0 \
> >  cluster-infrastructure=corosync \
> >  symmetric-cluster=false \
> >  stonith-enabled=true \
> >  no-quorum-policy=stop \
> >  start-failure-is-fatal=false \
> >  stonith-action=poweroff \
> >  node-health-strategy=migrate-on-red \
> >  last-lrm-refresh=1468855127
> > rsc_defaults rsc-options: \
> >  resource-stickiness=INFINITY \
> >  migration-threshold=2
> > 
> > pg2's /var/log/syslog:
> > 
> > [...]
> > Jul 18 19:20:53 localhost crmd[4654]: notice: Executing poweroff fencing
> > operation (52) on pg1 (timeout=60000)
> > Jul 18 19:20:53 localhost stonith-ng[4650]: notice: Client
> > crmd.4654.909c34cb
> > wants to fence (poweroff) 'pg1' with device '(any)'
> > Jul 18 19:20:53 localhost stonith-ng[4650]: notice: Initiating remote
> > operation poweroff for pg1: 4bc5bf9f-b180-49ad-b142-7f14f988687a (0)
> > Jul 18 19:20:53 localhost crmd[4654]: notice: Initiating action 8: start
> > p_ston_pg2_start_0 on pg3
> > Jul 18 19:20:53 localhost crmd[4654]: notice: Initiating action 10: start
> > p_ston_pg3_start_0 on pg2 (local)
> > Jul 18 19:20:55 localhost crmd[4654]: notice: Operation p_ston_pg3_start_0:
> > ok
> > (node=pg2, call=56, rc=0, cib-update=56, confirmed=true)
> > Jul 18 19:20:58 localhost stonith-ng[4650]: notice: Operation 'poweroff'
> > [6216] (call 2 from crmd.4654) for host 'pg1' with device 'p_ston_pg3'
> > returned:
> > 0 (OK)
> > Jul 18 19:20:58 localhost stonith-ng[4650]: notice: Operation poweroff of
> > pg1
> > by pg2 for crmd.4654 at pg2.4bc5bf9f: OK
> > Jul 18 19:20:58 localhost crmd[4654]: notice: Stonith operation
> > 2/52:0:0:577f46f1-b431-4b4d-9ed8-8a0918d791ce: OK (0)
> > Jul 18 19:20:58 localhost crmd[4654]: notice: Peer pg1 was terminated
> > (poweroff) by pg2 for pg2: OK (ref=4bc5bf9f-b180-49ad-b142-7f14f988687a) by
> > client crmd.4654
> > [...]




More information about the Users mailing list