[ClusterLabs] Pacemaker not always selecting the right stonith device

Mon Jul 18 22:51:40 UTC 2016

Hello all

I cannot wrap my brain around what's going on here ... any help would prevent me
from fencing my brain  =:-D

Problem:

When completely network isolating a node, i.e. pg1 - sometimes a different node
gets fenced instead, i.e. pg3 ... in this case I see a syslog message like this
indicating the wrong stonith device was used:
    stonith-ng[4650]:   notice: Operation 'poweroff' [6216] (call 2 from
crmd.4654) for host 'pg1' with device 'p_ston_pg3' returned: 0 (OK)

I had assumed that only the stonith resource p_ston_pg1 had hostname=pg1 and was
the only resource eligible to be used to fence pg1 !

Why would it use p_ston_pg3 then ?

Configuration summary - more details and logs below:

  * 3x nodes pg1, pg2 and pg3 
  * 3x stonith resources p_ston_pg1, p_ston_pg2 and p_ston_pg3 - one for each
node
  * symmetric-cluster=false (!), please see location constraints l_pgs_resources
and l_ston_pg1, l_ston_pg2 & l_ston_pg3 further below
  * We rely on /etc/hosts to resolve pg1, pg2 and pg3 for corosync - the actual
hostnames are completely different
  * We rely on the option "hostname" for stonith:external/ipmi to specify the
name of the host to be managed by the defined STONITH device.

The stonith registration looks wrong to me (?) - I expected 1 single stonith
device to be registered per host - see crm_mon output - only 1 p_ston_pgX
resource gets started per host (!):

root at test123:~# for node in pg{1..3} ; do ssh $node stonith_admin -L ; done
Warning: Permanently added 'pg1,10.148.128.28' (ECDSA) to the list of known
hosts.
2 devices found
 p_ston_pg3
 p_ston_pg2
Warning: Permanently added 'pg2,10.148.128.7' (ECDSA) to the list of known
hosts.
2 devices found
 p_ston_pg3
 p_ston_pg1
Warning: Permanently added 'pg3,10.148.128.37' (ECDSA) to the list of known
hosts.
2 devices found
 p_ston_pg1
 p_ston_pg2

... and for the host pg1 (same as for pg2 or pg3) 2x devices are found to fence
off pg1 - I would expect only 1 device to show up:

root at test123:~#    for node in pg{1..3} ; do ssh $node stonith_admin -l pg1 ;
done

Warning: Permanently added 'pg1,10.148.128.28' (ECDSA) to the list of known
hosts.
2 devices found
 p_ston_pg3
 p_ston_pg2

Warning: Permanently added 'pg2,10.148.128.7' (ECDSA) to the list of known
hosts.
2 devices found
 p_ston_pg1
 p_ston_pg3

Warning: Permanently added 'pg3,10.148.128.37' (ECDSA) to the list of known
hosts.
2 devices found
 p_ston_pg1
 p_ston_pg2

crm_mon monitor output:

root at test123:~# crm_mon -1
Last updated: Mon Jul 18 22:45:00 2016          Last change: Mon Jul 18 20:52:14
2016 by root via cibadmin on pg2
Stack: corosync
Current DC: pg1 (version 1.1.14-70404b0) - partition with quorum
3 nodes and 25 resources configured

Online: [ pg1 pg2 pg3 ]

 p_ston_pg1     (stonith:external/ipmi):        Started pg2
 p_ston_pg2     (stonith:external/ipmi):        Started pg3
 p_ston_pg3     (stonith:external/ipmi):        Started pg1

Configuration:

[...]

primitive p_ston_pg1 stonith:external/ipmi \
 params hostname=pg1 ipaddr=10.148.128.35 userid=root
passwd="/var/vcap/data/packages/pacemaker/ra-tmp/stonith/PG1-ipmipass"
passwd_method=file interface=lan priv=OPERATOR

primitive p_ston_pg2 stonith:external/ipmi \
 params hostname=pg2 ipaddr=10.148.128.19 userid=root
passwd="/var/vcap/data/packages/pacemaker/ra-tmp/stonith/PG2-ipmipass"
passwd_method=file interface=lan priv=OPERATOR

primitive p_ston_pg3 stonith:external/ipmi \
 params hostname=pg3 ipaddr=10.148.128.59 userid=root
passwd="/var/vcap/data/packages/pacemaker/ra-tmp/stonith/PG3-ipmipass"
passwd_method=file interface=lan priv=OPERATOR

location l_pgs_resources { otherstuff p_ston_pg1 p_ston_pg2 p_ston_pg3 }
resource-discovery=exclusive \
        rule #uname eq pg1 \
        rule #uname eq pg2 \
        rule #uname eq pg3

location l_ston_pg1 p_ston_pg1 -inf: pg1
location l_ston_pg2 p_ston_pg2 -inf: pg2
location l_ston_pg3 p_ston_pg3 -inf: pg3

[...]

property cib-bootstrap-options: \
        have-watchdog=false \
        dc-version=1.1.14-70404b0 \
        cluster-infrastructure=corosync \
        symmetric-cluster=false \
        stonith-enabled=true \
        no-quorum-policy=stop \
        start-failure-is-fatal=false \
        stonith-action=poweroff \
        node-health-strategy=migrate-on-red \
        last-lrm-refresh=1468855127
rsc_defaults rsc-options: \
        resource-stickiness=INFINITY \
        migration-threshold=2

pg2's /var/log/syslog:

[...]
Jul 18 19:20:53 localhost crmd[4654]:   notice: Executing poweroff fencing
operation (52) on pg1 (timeout=60000)
Jul 18 19:20:53 localhost stonith-ng[4650]:   notice: Client crmd.4654.909c34cb
wants to fence (poweroff) 'pg1' with device '(any)'
Jul 18 19:20:53 localhost stonith-ng[4650]:   notice: Initiating remote
operation poweroff for pg1: 4bc5bf9f-b180-49ad-b142-7f14f988687a (0)
Jul 18 19:20:53 localhost crmd[4654]:   notice: Initiating action 8: start
p_ston_pg2_start_0 on pg3
Jul 18 19:20:53 localhost crmd[4654]:   notice: Initiating action 10: start
p_ston_pg3_start_0 on pg2 (local)
Jul 18 19:20:55 localhost crmd[4654]:   notice: Operation p_ston_pg3_start_0: ok
(node=pg2, call=56, rc=0, cib-update=56, confirmed=true)
Jul 18 19:20:58 localhost stonith-ng[4650]:   notice: Operation 'poweroff'
[6216] (call 2 from crmd.4654) for host 'pg1' with device 'p_ston_pg3' returned:
0 (OK)
Jul 18 19:20:58 localhost stonith-ng[4650]:   notice: Operation poweroff of pg1
by pg2 for crmd.4654 at pg2.4bc5bf9f: OK
Jul 18 19:20:58 localhost crmd[4654]:   notice: Stonith operation
2/52:0:0:577f46f1-b431-4b4d-9ed8-8a0918d791ce: OK (0)
Jul 18 19:20:58 localhost crmd[4654]:   notice: Peer pg1 was terminated
(poweroff) by pg2 for pg2: OK (ref=4bc5bf9f-b180-49ad-b142-7f14f988687a) by
client crmd.4654
[...]