[ClusterLabs] Three VM's in cluster, running on multiple libvirt hosts, stonith not working

Tue Jun 2 15:40:23 UTC 2015

Hello,

I have 3 CentOS7 guests running on 3 CentOS7 hypervisors and I can't get
stonith operations to work.

Config:

Cluster Name: nfs
Corosync Nodes:
 node1 node2 node3
Pacemaker Nodes:
 node1 node2 node3

Resources:
 Group: group_rbd_fs_nfs_vip
  Resource: rbd_nfs-ha (class=ocf provider=ceph type=rbd.in)
   Attributes: user=admin pool=rbd name=nfs-ha cephconf=/etc/ceph/ceph.conf
   Operations: start interval=0s timeout=20 (rbd_nfs-ha-start-timeout-20)
               stop interval=0s timeout=20 (rbd_nfs-ha-stop-timeout-20)
               monitor interval=10s timeout=20s
(rbd_nfs-ha-monitor-interval-10s)
  Resource: rbd_home (class=ocf provider=ceph type=rbd.in)
   Attributes: user=admin pool=rbd name=home cephconf=/etc/ceph/ceph.conf
   Operations: start interval=0s timeout=20 (rbd_home-start-timeout-20)
               stop interval=0s timeout=20 (rbd_home-stop-timeout-20)
               monitor interval=10s timeout=20s
(rbd_home-monitor-interval-10s)
  Resource: fs_nfs-ha (class=ocf provider=heartbeat type=Filesystem)
   Attributes: directory=/mnt/nfs-ha fstype=btrfs
device=/dev/rbd/rbd/nfs-ha fast_stop=no
   Operations: monitor interval=20s timeout=40s
(fs_nfs-ha-monitor-interval-20s)
               start interval=0 timeout=60s (fs_nfs-ha-start-interval-0)
               stop interval=0 timeout=60s (fs_nfs-ha-stop-interval-0)
  Resource: FS_home (class=ocf provider=heartbeat type=Filesystem)
   Attributes: directory=/mnt/home fstype=btrfs device=/dev/rbd/rbd/home
options=rw,compress-force=lzo fast_stop=no
   Operations: monitor interval=20s timeout=40s
(FS_home-monitor-interval-20s)
               start interval=0 timeout=60s (FS_home-start-interval-0)
               stop interval=0 timeout=60s (FS_home-stop-interval-0)
  Resource: nfsserver (class=ocf provider=heartbeat type=nfsserver)
   Attributes: nfs_shared_infodir=/mnt/nfs-ha
   Operations: stop interval=0s timeout=20s (nfsserver-stop-timeout-20s)
               monitor interval=10s timeout=20s
(nfsserver-monitor-interval-10s)
               start interval=0 timeout=40s (nfsserver-start-interval-0)
  Resource: vip_nfs_private (class=ocf provider=heartbeat type=IPaddr)
   Attributes: ip=10.0.231.49 cidr_netmask=24
   Operations: start interval=0s timeout=20s
(vip_nfs_private-start-timeout-20s)
               stop interval=0s timeout=20s
(vip_nfs_private-stop-timeout-20s)
               monitor interval=5 (vip_nfs_private-monitor-interval-5)

Stonith Devices:
 Resource: NFS1 (class=stonith type=fence_xvm)
  Attributes: pcmk_host_list=10.0.231.50
key_file=/etc/cluster/fence_xvm_ceph1.key multicast_address=225.0.0.12
port=NFS1
  Operations: monitor interval=20s (NFS1-monitor-interval-20s)
 Resource: NFS2 (class=stonith type=fence_xvm)
  Attributes: pcmk_host_list=10.0.231.51
key_file=/etc/cluster/fence_xvm_ceph2.key multicast_address=225.0.1.12
port=NFS2
  Operations: monitor interval=20s (NFS2-monitor-interval-20s)
 Resource: NFS3 (class=stonith type=fence_xvm)
  Attributes: pcmk_host_list=10.0.231.52
key_file=/etc/cluster/fence_xvm_ceph3.key multicast_address=225.0.2.12
port=NFS3
  Operations: monitor interval=20s (NFS3-monitor-interval-20s)
Fencing Levels:

Location Constraints:
  Resource: NFS1
    Enabled on: node1 (score:1) (id:location-NFS1-node1-1)
    Enabled on: node2 (score:1000) (id:location-NFS1-node2-1000)
    Enabled on: node3 (score:500) (id:location-NFS1-node3-500)
  Resource: NFS2
    Enabled on: node2 (score:1) (id:location-NFS2-node2-1)
    Enabled on: node3 (score:1000) (id:location-NFS2-node3-1000)
    Enabled on: node1 (score:500) (id:location-NFS2-node1-500)
  Resource: NFS3
    Enabled on: node3 (score:1) (id:location-NFS3-node3-1)
    Enabled on: node1 (score:1000) (id:location-NFS3-node1-1000)
    Enabled on: node2 (score:500) (id:location-NFS3-node2-500)
Ordering Constraints:
Colocation Constraints:

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: nfs
 dc-version: 1.1.12-a14efad
 have-watchdog: false
 stonith-enabled: true

When I stop networking services on node1 (stonith resource NFS1) I see logs
on the other two cluster nodes attempting to reboot the vm NFS1 without
success.

Logs:

Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca    pengine:   notice: LogActions:
     Move    rbd_nfs-ha      (Started node1 -> node2)
Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca    pengine:   notice: LogActions:
     Move    rbd_home        (Started node1 -> node2)
Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca    pengine:   notice: LogActions:
     Move    fs_nfs-ha       (Started node1 -> node2)
Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca    pengine:   notice: LogActions:
     Move    FS_home (Started node1 -> node2)
Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca    pengine:   notice: LogActions:
     Move    nfsserver       (Started node1 -> node2)
Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca    pengine:   notice: LogActions:
     Move    vip_nfs_private (Started node1 -> node2)
Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca    pengine:     info: LogActions:
     Leave   NFS1    (Started node2)
Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca    pengine:     info: LogActions:
     Leave   NFS2    (Started node3)
Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca    pengine:   notice: LogActions:
     Move    NFS3    (Started node1 -> node2)
Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca    pengine:  warning:
process_pe_message:      Calculated Transition 8:
/var/lib/pacemaker/pengine/pe-warn-0.bz2
Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca       crmd:     info:
do_state_transition:     State transition S_POLICY_ENGINE ->
S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
origin=handle_response ]
Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca       crmd:     info:
do_te_invoke:    Processing graph 8 (ref=pe_calc-dc-1433198297-78) derived
from /var/lib/pacemaker/pengine/pe-warn-0.bz2
Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca       crmd:   notice:
te_fence_node:   Executing reboot fencing operation (37) on node1
(timeout=60000)
Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng:   notice:
handle_request:  Client crmd.2131.f7e79b61 wants to fence (reboot) 'node1'
with device '(any)'
Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng:   notice:
initiate_remote_stonith_op:      Initiating remote operation reboot for
node1: a22a16f3-b699-453e-a090-43a640dd0e3f (0)
Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng:   notice:
can_fence_host_with_device:      NFS1 can not fence (reboot) node1:
static-list
Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng:   notice:
can_fence_host_with_device:      NFS2 can not fence (reboot) node1:
static-list
Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng:   notice:
can_fence_host_with_device:      NFS3 can not fence (reboot) node1:
static-list
Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng:     info:
process_remote_stonith_query:    All queries have arrived, continuing (2,
2, 2, a22a16f3-b699-453e-a090-43a640dd0e3f)
Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng:   notice:
stonith_choose_peer:     Couldn't find anyone to fence node1 with <any>
Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng:     info:
call_remote_stonith:     Total remote op timeout set to 60 for fencing of
node node1 for crmd.2131.a22a16f3
Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng:     info:
call_remote_stonith:     None of the 2 peers have devices capable of
terminating node1 for crmd.2131 (0)
Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng:    error:
remote_op_done:  Operation reboot of node1 by <no-one> for
crmd.2131 at node3.a22a16f3: No such device
Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca       crmd:   notice:
tengine_stonith_callback:        Stonith operation
2/37:8:0:241ee032-f3a1-4c2b-8427-63af83b54343: No such device (-19)
Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca       crmd:   notice:
tengine_stonith_callback:        Stonith operation 2 for node1 failed (No
such device): aborting transition.
Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca       crmd:   notice:
abort_transition_graph:  Transition aborted: Stonith failed
(source=tengine_stonith_callback:697, 0)
Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca       crmd:   notice:
tengine_stonith_notify:  Peer node1 was not terminated (reboot) by <anyone>
for node3: No such device (ref=a22a16f3-b699-453e-a090-43a640dd0e3f) by
client crmd.2131
Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca       crmd:   notice: run_graph:
    Transition 8 (Complete=1, Pending=0, Fired=0, Skipped=27, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-warn-0.bz2): Stopped
Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca       crmd:   notice:
too_many_st_failures:    No devices found in cluster to fence node1, giving
up

I can manually fence a guest without any issue:
# fence_xvm -a 225.0.0.12 -k /etc/cluster/fence_xvm_ceph1.key -o reboot -H
NFS1

But the cluster doesn't recover resources to another host:
# pcs status *<-- after manual fencing*
Cluster name: nfs
Last updated: Tue Jun  2 08:34:18 2015
Last change: Mon Jun  1 16:02:58 2015
Stack: corosync
Current DC: node3 (3) - partition with quorum
Version: 1.1.12-a14efad
3 Nodes configured
9 Resources configured

Node node1 (1): UNCLEAN (offline)
Online: [ node2 node3 ]

Full list of resources:

 Resource Group: group_rbd_fs_nfs_vip
     rbd_nfs-ha (ocf::ceph:rbd.in):     Started node1
     rbd_home   (ocf::ceph:rbd.in):     Started node1
     fs_nfs-ha  (ocf::heartbeat:Filesystem):    Started node1
     FS_home    (ocf::heartbeat:Filesystem):    Started node1
     nfsserver  (ocf::heartbeat:nfsserver):     Started node1
     vip_nfs_private    (ocf::heartbeat:IPaddr):        Started node1
 NFS1   (stonith:fence_xvm):    Started node2
 NFS2   (stonith:fence_xvm):    Started node3
 NFS3   (stonith:fence_xvm):    Started node1

PCSD Status:
  node1: Online
  node2: Online
  node3: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Fence_virtd config on one of the hypervisors:
# cat fence_virt.conf
backends {
        libvirt {
                uri = "qemu:///system";
        }

}

listeners {
        multicast {
                port = "1229";
                family = "ipv4";
                interface = "br1";
                address = "225.0.0.12";
                key_file = "/etc/cluster/fence_xvm_ceph1.key";
        }

}

fence_virtd {
        module_path = "/usr/lib64/fence-virt";
        backend = "libvirt";
        listener = "multicast";
}

Thanks,
Steve
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150602/de871b21/attachment-0003.html>