[ClusterLabs] Three VM's in cluster, running on multiple libvirt hosts, stonith not working
Ken Gaillot
kgaillot at redhat.com
Tue Jun 2 17:07:13 UTC 2015
On 06/02/2015 11:40 AM, Steve Dainard wrote:
> Hello,
>
> I have 3 CentOS7 guests running on 3 CentOS7 hypervisors and I can't get
> stonith operations to work.
>
> Config:
>
> Cluster Name: nfs
> Corosync Nodes:
> node1 node2 node3
> Pacemaker Nodes:
> node1 node2 node3
>
> Resources:
> Group: group_rbd_fs_nfs_vip
> Resource: rbd_nfs-ha (class=ocf provider=ceph type=rbd.in)
> Attributes: user=admin pool=rbd name=nfs-ha cephconf=/etc/ceph/ceph.conf
> Operations: start interval=0s timeout=20 (rbd_nfs-ha-start-timeout-20)
> stop interval=0s timeout=20 (rbd_nfs-ha-stop-timeout-20)
> monitor interval=10s timeout=20s
> (rbd_nfs-ha-monitor-interval-10s)
> Resource: rbd_home (class=ocf provider=ceph type=rbd.in)
> Attributes: user=admin pool=rbd name=home cephconf=/etc/ceph/ceph.conf
> Operations: start interval=0s timeout=20 (rbd_home-start-timeout-20)
> stop interval=0s timeout=20 (rbd_home-stop-timeout-20)
> monitor interval=10s timeout=20s
> (rbd_home-monitor-interval-10s)
> Resource: fs_nfs-ha (class=ocf provider=heartbeat type=Filesystem)
> Attributes: directory=/mnt/nfs-ha fstype=btrfs
> device=/dev/rbd/rbd/nfs-ha fast_stop=no
> Operations: monitor interval=20s timeout=40s
> (fs_nfs-ha-monitor-interval-20s)
> start interval=0 timeout=60s (fs_nfs-ha-start-interval-0)
> stop interval=0 timeout=60s (fs_nfs-ha-stop-interval-0)
> Resource: FS_home (class=ocf provider=heartbeat type=Filesystem)
> Attributes: directory=/mnt/home fstype=btrfs device=/dev/rbd/rbd/home
> options=rw,compress-force=lzo fast_stop=no
> Operations: monitor interval=20s timeout=40s
> (FS_home-monitor-interval-20s)
> start interval=0 timeout=60s (FS_home-start-interval-0)
> stop interval=0 timeout=60s (FS_home-stop-interval-0)
> Resource: nfsserver (class=ocf provider=heartbeat type=nfsserver)
> Attributes: nfs_shared_infodir=/mnt/nfs-ha
> Operations: stop interval=0s timeout=20s (nfsserver-stop-timeout-20s)
> monitor interval=10s timeout=20s
> (nfsserver-monitor-interval-10s)
> start interval=0 timeout=40s (nfsserver-start-interval-0)
> Resource: vip_nfs_private (class=ocf provider=heartbeat type=IPaddr)
> Attributes: ip=10.0.231.49 cidr_netmask=24
> Operations: start interval=0s timeout=20s
> (vip_nfs_private-start-timeout-20s)
> stop interval=0s timeout=20s
> (vip_nfs_private-stop-timeout-20s)
> monitor interval=5 (vip_nfs_private-monitor-interval-5)
>
> Stonith Devices:
> Resource: NFS1 (class=stonith type=fence_xvm)
> Attributes: pcmk_host_list=10.0.231.50
> key_file=/etc/cluster/fence_xvm_ceph1.key multicast_address=225.0.0.12
> port=NFS1
> Operations: monitor interval=20s (NFS1-monitor-interval-20s)
> Resource: NFS2 (class=stonith type=fence_xvm)
> Attributes: pcmk_host_list=10.0.231.51
> key_file=/etc/cluster/fence_xvm_ceph2.key multicast_address=225.0.1.12
> port=NFS2
> Operations: monitor interval=20s (NFS2-monitor-interval-20s)
> Resource: NFS3 (class=stonith type=fence_xvm)
> Attributes: pcmk_host_list=10.0.231.52
> key_file=/etc/cluster/fence_xvm_ceph3.key multicast_address=225.0.2.12
> port=NFS3
I think pcmk_host_list should have the node name rather than the IP
address. If fence_xvm -o list -a whatever shows the right nodes to
fence, you don't even need to set pcmk_host_list.
> Operations: monitor interval=20s (NFS3-monitor-interval-20s)
> Fencing Levels:
>
> Location Constraints:
> Resource: NFS1
> Enabled on: node1 (score:1) (id:location-NFS1-node1-1)
> Enabled on: node2 (score:1000) (id:location-NFS1-node2-1000)
> Enabled on: node3 (score:500) (id:location-NFS1-node3-500)
> Resource: NFS2
> Enabled on: node2 (score:1) (id:location-NFS2-node2-1)
> Enabled on: node3 (score:1000) (id:location-NFS2-node3-1000)
> Enabled on: node1 (score:500) (id:location-NFS2-node1-500)
> Resource: NFS3
> Enabled on: node3 (score:1) (id:location-NFS3-node3-1)
> Enabled on: node1 (score:1000) (id:location-NFS3-node1-1000)
> Enabled on: node2 (score:500) (id:location-NFS3-node2-500)
> Ordering Constraints:
> Colocation Constraints:
>
> Cluster Properties:
> cluster-infrastructure: corosync
> cluster-name: nfs
> dc-version: 1.1.12-a14efad
> have-watchdog: false
> stonith-enabled: true
>
> When I stop networking services on node1 (stonith resource NFS1) I see logs
> on the other two cluster nodes attempting to reboot the vm NFS1 without
> success.
>
> Logs:
>
> Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca pengine: notice: LogActions:
> Move rbd_nfs-ha (Started node1 -> node2)
> Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca pengine: notice: LogActions:
> Move rbd_home (Started node1 -> node2)
> Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca pengine: notice: LogActions:
> Move fs_nfs-ha (Started node1 -> node2)
> Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca pengine: notice: LogActions:
> Move FS_home (Started node1 -> node2)
> Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca pengine: notice: LogActions:
> Move nfsserver (Started node1 -> node2)
> Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca pengine: notice: LogActions:
> Move vip_nfs_private (Started node1 -> node2)
> Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca pengine: info: LogActions:
> Leave NFS1 (Started node2)
> Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca pengine: info: LogActions:
> Leave NFS2 (Started node3)
> Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca pengine: notice: LogActions:
> Move NFS3 (Started node1 -> node2)
> Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca pengine: warning:
> process_pe_message: Calculated Transition 8:
> /var/lib/pacemaker/pengine/pe-warn-0.bz2
> Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca crmd: info:
> do_state_transition: State transition S_POLICY_ENGINE ->
> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> origin=handle_response ]
> Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca crmd: info:
> do_te_invoke: Processing graph 8 (ref=pe_calc-dc-1433198297-78) derived
> from /var/lib/pacemaker/pengine/pe-warn-0.bz2
> Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca crmd: notice:
> te_fence_node: Executing reboot fencing operation (37) on node1
> (timeout=60000)
> Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng: notice:
> handle_request: Client crmd.2131.f7e79b61 wants to fence (reboot) 'node1'
> with device '(any)'
> Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng: notice:
> initiate_remote_stonith_op: Initiating remote operation reboot for
> node1: a22a16f3-b699-453e-a090-43a640dd0e3f (0)
> Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng: notice:
> can_fence_host_with_device: NFS1 can not fence (reboot) node1:
> static-list
> Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng: notice:
> can_fence_host_with_device: NFS2 can not fence (reboot) node1:
> static-list
> Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng: notice:
> can_fence_host_with_device: NFS3 can not fence (reboot) node1:
> static-list
> Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng: info:
> process_remote_stonith_query: All queries have arrived, continuing (2,
> 2, 2, a22a16f3-b699-453e-a090-43a640dd0e3f)
> Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng: notice:
> stonith_choose_peer: Couldn't find anyone to fence node1 with <any>
> Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng: info:
> call_remote_stonith: Total remote op timeout set to 60 for fencing of
> node node1 for crmd.2131.a22a16f3
> Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng: info:
> call_remote_stonith: None of the 2 peers have devices capable of
> terminating node1 for crmd.2131 (0)
> Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng: error:
> remote_op_done: Operation reboot of node1 by <no-one> for
> crmd.2131 at node3.a22a16f3: No such device
> Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca crmd: notice:
> tengine_stonith_callback: Stonith operation
> 2/37:8:0:241ee032-f3a1-4c2b-8427-63af83b54343: No such device (-19)
> Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca crmd: notice:
> tengine_stonith_callback: Stonith operation 2 for node1 failed (No
> such device): aborting transition.
> Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca crmd: notice:
> abort_transition_graph: Transition aborted: Stonith failed
> (source=tengine_stonith_callback:697, 0)
> Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca crmd: notice:
> tengine_stonith_notify: Peer node1 was not terminated (reboot) by <anyone>
> for node3: No such device (ref=a22a16f3-b699-453e-a090-43a640dd0e3f) by
> client crmd.2131
> Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca crmd: notice: run_graph:
> Transition 8 (Complete=1, Pending=0, Fired=0, Skipped=27, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-warn-0.bz2): Stopped
> Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca crmd: notice:
> too_many_st_failures: No devices found in cluster to fence node1, giving
> up
>
> I can manually fence a guest without any issue:
> # fence_xvm -a 225.0.0.12 -k /etc/cluster/fence_xvm_ceph1.key -o reboot -H
> NFS1
>
> But the cluster doesn't recover resources to another host:
The cluster doesn't know that the manual fencing succeeded, so it plays
it safe by not moving resources. If you fix the cluster fencing issue,
I'd expect this to work.
> # pcs status *<-- after manual fencing*
> Cluster name: nfs
> Last updated: Tue Jun 2 08:34:18 2015
> Last change: Mon Jun 1 16:02:58 2015
> Stack: corosync
> Current DC: node3 (3) - partition with quorum
> Version: 1.1.12-a14efad
> 3 Nodes configured
> 9 Resources configured
>
>
> Node node1 (1): UNCLEAN (offline)
> Online: [ node2 node3 ]
>
> Full list of resources:
>
> Resource Group: group_rbd_fs_nfs_vip
> rbd_nfs-ha (ocf::ceph:rbd.in): Started node1
> rbd_home (ocf::ceph:rbd.in): Started node1
> fs_nfs-ha (ocf::heartbeat:Filesystem): Started node1
> FS_home (ocf::heartbeat:Filesystem): Started node1
> nfsserver (ocf::heartbeat:nfsserver): Started node1
> vip_nfs_private (ocf::heartbeat:IPaddr): Started node1
> NFS1 (stonith:fence_xvm): Started node2
> NFS2 (stonith:fence_xvm): Started node3
> NFS3 (stonith:fence_xvm): Started node1
>
> PCSD Status:
> node1: Online
> node2: Online
> node3: Online
>
> Daemon Status:
> corosync: active/disabled
> pacemaker: active/disabled
> pcsd: active/enabled
>
> Fence_virtd config on one of the hypervisors:
> # cat fence_virt.conf
> backends {
> libvirt {
> uri = "qemu:///system";
> }
>
> }
>
> listeners {
> multicast {
> port = "1229";
> family = "ipv4";
> interface = "br1";
> address = "225.0.0.12";
> key_file = "/etc/cluster/fence_xvm_ceph1.key";
> }
>
> }
>
> fence_virtd {
> module_path = "/usr/lib64/fence-virt";
> backend = "libvirt";
> listener = "multicast";
> }
More information about the Users
mailing list