[Pacemaker] "stonith_admin -F node" results in a pair of reboots

Tue Dec 31 19:57:57 EST 2013

Greetings ... Happy New Year!

I am testing a configuration that is created from example in "Chapter 6.
Configuring a GFS2 File System in a Cluster" of the "Red Hat Enterprise
Linux 7.0 Beta Global File System 2" document.  Only addition is
stonith:fence_ipmilan.  After encountering this issue when I configured
with "crm", I re-configured using "pcs". I've included the configuration
below.

I'm thinking that, in a 2-node cluster, if I run "stonith_admin -F
<peer-node>", then <peer-node> should reboot and cleanly rejoin the
cluster.  This is not happening.  

What ultimately happens is that after the initially fenced node reboots,
the system from which the stonith_admin -F command was run is fenced and
reboots. The fencing stops there, leaving the cluster in an appropriate
state.

The issue seems to reside with clvmd/lvm.  With the reboot of the
initially fenced node, the clvmd resource fails on the surviving node,
with a maximum of errors.  I hypothesize there is an issue with locks,
but have insufficient knowledge of clvmd/lvm locks to prove or disprove
this hypothesis.

Have I missed something ...

1) Is this expected behavior, and always the reboot of the fencing node
happens? 

2) Or, maybe I didn't correctly duplicate the Chapter 6 example?

3) Or, perhaps something is wrong or omitted from the Chapter 6 example?

Suggestions will be much appreciated.

Thanks,
Bob Haxo

RHEL6.5
pacemaker-cli-1.1.10-14.el6_5.1.x86_64
crmsh-1.2.5-55.1sgi709r3.rhel6.x86_64
pacemaker-libs-1.1.10-14.el6_5.1.x86_64
cman-3.0.12.1-59.el6_5.1.x86_64
pacemaker-1.1.10-14.el6_5.1.x86_64
corosynclib-1.4.1-17.el6.x86_64
corosync-1.4.1-17.el6.x86_64
pacemaker-cluster-libs-1.1.10-14.el6_5.1.x86_64

Cluster Name: mici
Corosync Nodes:

Pacemaker Nodes:
 mici-admin mici-admin2

Resources:
 Clone: clusterfs-clone
  Meta Attrs: interleave=true target-role=Started
  Resource: clusterfs (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/vgha2/lv_clust2 directory=/images fstype=gfs2
options=defaults,noatime,nodiratime
   Operations: monitor on-fail=fence interval=30s
(clusterfs-monitor-interval-30s)
 Clone: clvmd-clone
  Meta Attrs: interleave=true ordered=true target-role=Started
  Resource: clvmd (class=lsb type=clvmd)
   Operations: monitor on-fail=fence interval=30s
(clvmd-monitor-interval-30s)
 Clone: dlm-clone
  Meta Attrs: interleave=true ordered=true
  Resource: dlm (class=ocf provider=pacemaker type=controld)
   Operations: monitor on-fail=fence interval=30s
(dlm-monitor-interval-30s)

Stonith Devices:
 Resource: p_ipmi_fencing_1 (class=stonith type=fence_ipmilan)
  Attributes: ipaddr=128.##.##.78 login=XXXXX passwd=XXXXX lanplus=1
action=reboot pcmk_host_check=static-list pcmk_host_list=mici-admin
  Meta Attrs: target-role=Started
  Operations: monitor start-delay=30 interval=60s timeout=30
(p_ipmi_fencing_1-monitor-60s)
 Resource: p_ipmi_fencing_2 (class=stonith type=fence_ipmilan)
  Attributes: ipaddr=128.##.##.220 login=XXXXX passwd=XXXXX lanplus=1
action=reboot pcmk_host_check=static-list pcmk_host_list=mici-admin2
  Meta Attrs: target-role=Started
  Operations: monitor start-delay=30 interval=60s timeout=30
(p_ipmi_fencing_2-monitor-60s)
Fencing Levels:

Location Constraints:
  Resource: p_ipmi_fencing_1
    Disabled on: mici-admin (score:-INFINITY)
(id:location-p_ipmi_fencing_1-mici-admin--INFINITY)
  Resource: p_ipmi_fencing_2
    Disabled on: mici-admin2 (score:-INFINITY)
(id:location-p_ipmi_fencing_2-mici-admin2--INFINITY)
Ordering Constraints:
  start dlm-clone then start clvmd-clone (Mandatory)
(id:order-dlm-clone-clvmd-clone-mandatory)
  start clvmd-clone then start clusterfs-clone (Mandatory)
(id:order-clvmd-clone-clusterfs-clone-mandatory)
Colocation Constraints:
  clusterfs-clone with clvmd-clone (INFINITY)
(id:colocation-clusterfs-clone-clvmd-clone-INFINITY)
  clvmd-clone with dlm-clone (INFINITY)
(id:colocation-clvmd-clone-dlm-clone-INFINITY)

Cluster Properties:
 cluster-infrastructure: cman
 dc-version: 1.1.10-14.el6_5.1-368c726
 last-lrm-refresh: 1388530552
 no-quorum-policy: ignore
 stonith-enabled: true
Node Attributes:
 mici-admin: standby=off
 mici-admin2: standby=off

Last updated: Tue Dec 31 17:15:55 2013
Last change: Tue Dec 31 16:57:37 2013 via cibadmin on mici-admin
Stack: cman
Current DC: mici-admin2 - partition with quorum
Version: 1.1.10-14.el6_5.1-368c726
2 Nodes configured
8 Resources configured

Online: [ mici-admin mici-admin2 ]

Full list of resources:

p_ipmi_fencing_1        (stonith:fence_ipmilan):        Started
mici-admin2
p_ipmi_fencing_2        (stonith:fence_ipmilan):        Started
mici-admin
 Clone Set: clusterfs-clone [clusterfs]
     Started: [ mici-admin mici-admin2 ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ mici-admin mici-admin2 ]
 Clone Set: dlm-clone [dlm]
     Started: [ mici-admin mici-admin2 ]

Migration summary:
* Node mici-admin:
* Node mici-admin2:

=====================================================
crm_mon  after the fenced node reboots.  Shows the failure of clvmd that
then
occurs, which in turn triggers a fencing of that nnode

Last updated: Tue Dec 31 17:06:55 2013
Last change: Tue Dec 31 16:57:37 2013 via cibadmin on mici-admin
Stack: cman
Current DC: mici-admin - partition with quorum
Version: 1.1.10-14.el6_5.1-368c726
2 Nodes configured
8 Resources configured

Node mici-admin: UNCLEAN (online)
Online: [ mici-admin2 ]

Full list of resources:

p_ipmi_fencing_1        (stonith:fence_ipmilan):        Stopped
p_ipmi_fencing_2        (stonith:fence_ipmilan):        Started
mici-admin
 Clone Set: clusterfs-clone [clusterfs]
     Started: [ mici-admin ]
     Stopped: [ mici-admin2 ]
 Clone Set: clvmd-clone [clvmd]
     clvmd      (lsb:clvmd):    FAILED mici-admin
     Stopped: [ mici-admin2 ]
 Clone Set: dlm-clone [dlm]
     Started: [ mici-admin mici-admin2 ]

Migration summary:
* Node mici-admin:
   clvmd: migration-threshold=1000000 fail-count=1 last-failure='Tue Dec
31 17:04:29 2013'
* Node mici-admin2:

Failed actions:
    clvmd_monitor_30000 on mici-admin 'unknown error' (1): call=60,
status=Timed Out, la
st-rc-change='Tue Dec 31 17:04:29 2013', queued=0ms, exec=0ms

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131231/a2568175/attachment-0002.html>