<div dir="ltr"><div>Hi Ken,</div><div><br></div><div>I've tried configuring without pcmk_host_list as well with the same result.</div><div><br></div><div>Stonith Devices: </div><div> Resource: NFS1 (class=stonith type=fence_xvm)</div><div> Attributes: key_file=/etc/cluster/fence_xvm_ceph1.key multicast_address=225.0.0.12 port=NFS1 </div><div> Operations: monitor interval=20s (NFS1-monitor-interval-20s)</div><div> Resource: NFS2 (class=stonith type=fence_xvm)</div><div> Attributes: key_file=/etc/cluster/fence_xvm_ceph2.key multicast_address=225.0.1.12 port=NFS2 </div><div> Operations: monitor interval=20s (NFS2-monitor-interval-20s)</div><div> Resource: NFS3 (class=stonith type=fence_xvm)</div><div> Attributes: key_file=/etc/cluster/fence_xvm_ceph3.key multicast_address=225.0.2.12 port=NFS3 </div><div> Operations: monitor interval=20s (NFS3-monitor-interval-20s)</div><div><br></div><div>I can get the list of VM's from any of the 3 cluster nodes using the multicast address:</div><div><br></div><div><div># fence_xvm -a 225.0.0.12 -k /etc/cluster/fence_xvm_ceph1.key -o list</div><div>NFS1 1814d93d-3e40-797f-a3c6-102aaa6a3d01 on<br></div></div><div><br></div><div><div># fence_xvm -a 225.0.1.12 -k /etc/cluster/fence_xvm_ceph2.key -o list</div><div>NFS2 75ab85fc-40e9-45ae-8b0a-c346d59b24e8 on<br></div></div><div><br></div><div><div># fence_xvm -a 225.0.2.12 -k /etc/cluster/fence_xvm_ceph3.key -o list</div><div>NFS3 f23cca5d-d50b-46d2-85dd-d8357337fd22 on<br></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jun 2, 2015 at 10:07 AM, Ken Gaillot <span dir="ltr"><<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On 06/02/2015 11:40 AM, Steve Dainard wrote:<br>
> Hello,<br>
><br>
> I have 3 CentOS7 guests running on 3 CentOS7 hypervisors and I can't get<br>
> stonith operations to work.<br>
><br>
> Config:<br>
><br>
> Cluster Name: nfs<br>
> Corosync Nodes:<br>
> node1 node2 node3<br>
> Pacemaker Nodes:<br>
> node1 node2 node3<br>
><br>
> Resources:<br>
> Group: group_rbd_fs_nfs_vip<br>
> Resource: rbd_nfs-ha (class=ocf provider=ceph type=<a href="http://rbd.in" target="_blank">rbd.in</a>)<br>
> Attributes: user=admin pool=rbd name=nfs-ha cephconf=/etc/ceph/ceph.conf<br>
> Operations: start interval=0s timeout=20 (rbd_nfs-ha-start-timeout-20)<br>
> stop interval=0s timeout=20 (rbd_nfs-ha-stop-timeout-20)<br>
> monitor interval=10s timeout=20s<br>
> (rbd_nfs-ha-monitor-interval-10s)<br>
> Resource: rbd_home (class=ocf provider=ceph type=<a href="http://rbd.in" target="_blank">rbd.in</a>)<br>
> Attributes: user=admin pool=rbd name=home cephconf=/etc/ceph/ceph.conf<br>
> Operations: start interval=0s timeout=20 (rbd_home-start-timeout-20)<br>
> stop interval=0s timeout=20 (rbd_home-stop-timeout-20)<br>
> monitor interval=10s timeout=20s<br>
> (rbd_home-monitor-interval-10s)<br>
> Resource: fs_nfs-ha (class=ocf provider=heartbeat type=Filesystem)<br>
> Attributes: directory=/mnt/nfs-ha fstype=btrfs<br>
> device=/dev/rbd/rbd/nfs-ha fast_stop=no<br>
> Operations: monitor interval=20s timeout=40s<br>
> (fs_nfs-ha-monitor-interval-20s)<br>
> start interval=0 timeout=60s (fs_nfs-ha-start-interval-0)<br>
> stop interval=0 timeout=60s (fs_nfs-ha-stop-interval-0)<br>
> Resource: FS_home (class=ocf provider=heartbeat type=Filesystem)<br>
> Attributes: directory=/mnt/home fstype=btrfs device=/dev/rbd/rbd/home<br>
> options=rw,compress-force=lzo fast_stop=no<br>
> Operations: monitor interval=20s timeout=40s<br>
> (FS_home-monitor-interval-20s)<br>
> start interval=0 timeout=60s (FS_home-start-interval-0)<br>
> stop interval=0 timeout=60s (FS_home-stop-interval-0)<br>
> Resource: nfsserver (class=ocf provider=heartbeat type=nfsserver)<br>
> Attributes: nfs_shared_infodir=/mnt/nfs-ha<br>
> Operations: stop interval=0s timeout=20s (nfsserver-stop-timeout-20s)<br>
> monitor interval=10s timeout=20s<br>
> (nfsserver-monitor-interval-10s)<br>
> start interval=0 timeout=40s (nfsserver-start-interval-0)<br>
> Resource: vip_nfs_private (class=ocf provider=heartbeat type=IPaddr)<br>
> Attributes: ip=10.0.231.49 cidr_netmask=24<br>
> Operations: start interval=0s timeout=20s<br>
> (vip_nfs_private-start-timeout-20s)<br>
> stop interval=0s timeout=20s<br>
> (vip_nfs_private-stop-timeout-20s)<br>
> monitor interval=5 (vip_nfs_private-monitor-interval-5)<br>
><br>
> Stonith Devices:<br>
> Resource: NFS1 (class=stonith type=fence_xvm)<br>
> Attributes: pcmk_host_list=10.0.231.50<br>
> key_file=/etc/cluster/fence_xvm_ceph1.key multicast_address=225.0.0.12<br>
> port=NFS1<br>
> Operations: monitor interval=20s (NFS1-monitor-interval-20s)<br>
> Resource: NFS2 (class=stonith type=fence_xvm)<br>
> Attributes: pcmk_host_list=10.0.231.51<br>
> key_file=/etc/cluster/fence_xvm_ceph2.key multicast_address=225.0.1.12<br>
> port=NFS2<br>
> Operations: monitor interval=20s (NFS2-monitor-interval-20s)<br>
> Resource: NFS3 (class=stonith type=fence_xvm)<br>
> Attributes: pcmk_host_list=10.0.231.52<br>
> key_file=/etc/cluster/fence_xvm_ceph3.key multicast_address=225.0.2.12<br>
> port=NFS3<br>
<br>
</div></div>I think pcmk_host_list should have the node name rather than the IP<br>
address. If fence_xvm -o list -a whatever shows the right nodes to<br>
fence, you don't even need to set pcmk_host_list.<br>
<div><div class="h5"><br>
> Operations: monitor interval=20s (NFS3-monitor-interval-20s)<br>
> Fencing Levels:<br>
><br>
> Location Constraints:<br>
> Resource: NFS1<br>
> Enabled on: node1 (score:1) (id:location-NFS1-node1-1)<br>
> Enabled on: node2 (score:1000) (id:location-NFS1-node2-1000)<br>
> Enabled on: node3 (score:500) (id:location-NFS1-node3-500)<br>
> Resource: NFS2<br>
> Enabled on: node2 (score:1) (id:location-NFS2-node2-1)<br>
> Enabled on: node3 (score:1000) (id:location-NFS2-node3-1000)<br>
> Enabled on: node1 (score:500) (id:location-NFS2-node1-500)<br>
> Resource: NFS3<br>
> Enabled on: node3 (score:1) (id:location-NFS3-node3-1)<br>
> Enabled on: node1 (score:1000) (id:location-NFS3-node1-1000)<br>
> Enabled on: node2 (score:500) (id:location-NFS3-node2-500)<br>
> Ordering Constraints:<br>
> Colocation Constraints:<br>
><br>
> Cluster Properties:<br>
> cluster-infrastructure: corosync<br>
> cluster-name: nfs<br>
> dc-version: 1.1.12-a14efad<br>
> have-watchdog: false<br>
> stonith-enabled: true<br>
><br>
> When I stop networking services on node1 (stonith resource NFS1) I see logs<br>
> on the other two cluster nodes attempting to reboot the vm NFS1 without<br>
> success.<br>
><br>
> Logs:<br>
><br>
> Jun 01 15:38:17 [2130] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> pengine: notice: LogActions:<br>
> Move rbd_nfs-ha (Started node1 -> node2)<br>
> Jun 01 15:38:17 [2130] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> pengine: notice: LogActions:<br>
> Move rbd_home (Started node1 -> node2)<br>
> Jun 01 15:38:17 [2130] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> pengine: notice: LogActions:<br>
> Move fs_nfs-ha (Started node1 -> node2)<br>
> Jun 01 15:38:17 [2130] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> pengine: notice: LogActions:<br>
> Move FS_home (Started node1 -> node2)<br>
> Jun 01 15:38:17 [2130] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> pengine: notice: LogActions:<br>
> Move nfsserver (Started node1 -> node2)<br>
> Jun 01 15:38:17 [2130] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> pengine: notice: LogActions:<br>
> Move vip_nfs_private (Started node1 -> node2)<br>
> Jun 01 15:38:17 [2130] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> pengine: info: LogActions:<br>
> Leave NFS1 (Started node2)<br>
> Jun 01 15:38:17 [2130] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> pengine: info: LogActions:<br>
> Leave NFS2 (Started node3)<br>
> Jun 01 15:38:17 [2130] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> pengine: notice: LogActions:<br>
> Move NFS3 (Started node1 -> node2)<br>
> Jun 01 15:38:17 [2130] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> pengine: warning:<br>
> process_pe_message: Calculated Transition 8:<br>
> /var/lib/pacemaker/pengine/pe-warn-0.bz2<br>
> Jun 01 15:38:17 [2131] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> crmd: info:<br>
> do_state_transition: State transition S_POLICY_ENGINE -><br>
> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE<br>
> origin=handle_response ]<br>
> Jun 01 15:38:17 [2131] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> crmd: info:<br>
> do_te_invoke: Processing graph 8 (ref=pe_calc-dc-1433198297-78) derived<br>
> from /var/lib/pacemaker/pengine/pe-warn-0.bz2<br>
> Jun 01 15:38:17 [2131] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> crmd: notice:<br>
> te_fence_node: Executing reboot fencing operation (37) on node1<br>
> (timeout=60000)<br>
> Jun 01 15:38:17 [2127] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> stonith-ng: notice:<br>
> handle_request: Client crmd.2131.f7e79b61 wants to fence (reboot) 'node1'<br>
> with device '(any)'<br>
> Jun 01 15:38:17 [2127] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> stonith-ng: notice:<br>
> initiate_remote_stonith_op: Initiating remote operation reboot for<br>
> node1: a22a16f3-b699-453e-a090-43a640dd0e3f (0)<br>
> Jun 01 15:38:17 [2127] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> stonith-ng: notice:<br>
> can_fence_host_with_device: NFS1 can not fence (reboot) node1:<br>
> static-list<br>
> Jun 01 15:38:17 [2127] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> stonith-ng: notice:<br>
> can_fence_host_with_device: NFS2 can not fence (reboot) node1:<br>
> static-list<br>
> Jun 01 15:38:17 [2127] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> stonith-ng: notice:<br>
> can_fence_host_with_device: NFS3 can not fence (reboot) node1:<br>
> static-list<br>
> Jun 01 15:38:17 [2127] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> stonith-ng: info:<br>
> process_remote_stonith_query: All queries have arrived, continuing (2,<br>
> 2, 2, a22a16f3-b699-453e-a090-43a640dd0e3f)<br>
> Jun 01 15:38:17 [2127] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> stonith-ng: notice:<br>
> stonith_choose_peer: Couldn't find anyone to fence node1 with <any><br>
> Jun 01 15:38:17 [2127] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> stonith-ng: info:<br>
> call_remote_stonith: Total remote op timeout set to 60 for fencing of<br>
> node node1 for crmd.2131.a22a16f3<br>
> Jun 01 15:38:17 [2127] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> stonith-ng: info:<br>
> call_remote_stonith: None of the 2 peers have devices capable of<br>
> terminating node1 for crmd.2131 (0)<br>
> Jun 01 15:38:17 [2127] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> stonith-ng: error:<br>
> remote_op_done: Operation reboot of node1 by <no-one> for<br>
> crmd.2131@node3.a22a16f3: No such device<br>
> Jun 01 15:38:17 [2131] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> crmd: notice:<br>
> tengine_stonith_callback: Stonith operation<br>
> 2/37:8:0:241ee032-f3a1-4c2b-8427-63af83b54343: No such device (-19)<br>
> Jun 01 15:38:17 [2131] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> crmd: notice:<br>
> tengine_stonith_callback: Stonith operation 2 for node1 failed (No<br>
> such device): aborting transition.<br>
> Jun 01 15:38:17 [2131] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> crmd: notice:<br>
> abort_transition_graph: Transition aborted: Stonith failed<br>
> (source=tengine_stonith_callback:697, 0)<br>
> Jun 01 15:38:17 [2131] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> crmd: notice:<br>
> tengine_stonith_notify: Peer node1 was not terminated (reboot) by <anyone><br>
> for node3: No such device (ref=a22a16f3-b699-453e-a090-43a640dd0e3f) by<br>
> client crmd.2131<br>
> Jun 01 15:38:17 [2131] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> crmd: notice: run_graph:<br>
> Transition 8 (Complete=1, Pending=0, Fired=0, Skipped=27, Incomplete=0,<br>
> Source=/var/lib/pacemaker/pengine/pe-warn-0.bz2): Stopped<br>
> Jun 01 15:38:17 [2131] <a href="http://nfs3.pcic.uvic.ca" target="_blank">nfs3.pcic.uvic.ca</a> crmd: notice:<br>
> too_many_st_failures: No devices found in cluster to fence node1, giving<br>
> up<br>
><br>
> I can manually fence a guest without any issue:<br>
> # fence_xvm -a 225.0.0.12 -k /etc/cluster/fence_xvm_ceph1.key -o reboot -H<br>
> NFS1<br>
><br>
> But the cluster doesn't recover resources to another host:<br>
<br>
</div></div>The cluster doesn't know that the manual fencing succeeded, so it plays<br>
it safe by not moving resources. If you fix the cluster fencing issue,<br>
I'd expect this to work.<br>
<br>
> # pcs status *<-- after manual fencing*<br>
<div><div class="h5">> Cluster name: nfs<br>
> Last updated: Tue Jun 2 08:34:18 2015<br>
> Last change: Mon Jun 1 16:02:58 2015<br>
> Stack: corosync<br>
> Current DC: node3 (3) - partition with quorum<br>
> Version: 1.1.12-a14efad<br>
> 3 Nodes configured<br>
> 9 Resources configured<br>
><br>
><br>
> Node node1 (1): UNCLEAN (offline)<br>
> Online: [ node2 node3 ]<br>
><br>
> Full list of resources:<br>
><br>
> Resource Group: group_rbd_fs_nfs_vip<br>
> rbd_nfs-ha (ocf::ceph:<a href="http://rbd.in" target="_blank">rbd.in</a>): Started node1<br>
> rbd_home (ocf::ceph:<a href="http://rbd.in" target="_blank">rbd.in</a>): Started node1<br>
> fs_nfs-ha (ocf::heartbeat:Filesystem): Started node1<br>
> FS_home (ocf::heartbeat:Filesystem): Started node1<br>
> nfsserver (ocf::heartbeat:nfsserver): Started node1<br>
> vip_nfs_private (ocf::heartbeat:IPaddr): Started node1<br>
> NFS1 (stonith:fence_xvm): Started node2<br>
> NFS2 (stonith:fence_xvm): Started node3<br>
> NFS3 (stonith:fence_xvm): Started node1<br>
><br>
> PCSD Status:<br>
> node1: Online<br>
> node2: Online<br>
> node3: Online<br>
><br>
> Daemon Status:<br>
> corosync: active/disabled<br>
> pacemaker: active/disabled<br>
> pcsd: active/enabled<br>
><br>
> Fence_virtd config on one of the hypervisors:<br>
> # cat fence_virt.conf<br>
> backends {<br>
> libvirt {<br>
> uri = "qemu:///system";<br>
> }<br>
><br>
> }<br>
><br>
> listeners {<br>
> multicast {<br>
> port = "1229";<br>
> family = "ipv4";<br>
> interface = "br1";<br>
> address = "225.0.0.12";<br>
> key_file = "/etc/cluster/fence_xvm_ceph1.key";<br>
> }<br>
><br>
> }<br>
><br>
> fence_virtd {<br>
> module_path = "/usr/lib64/fence-virt";<br>
> backend = "libvirt";<br>
> listener = "multicast";<br>
> }<br>
<br>
<br>
</div></div>_______________________________________________<br>
Users mailing list: <a href="mailto:Users@clusterlabs.org">Users@clusterlabs.org</a><br>
<a href="http://clusterlabs.org/mailman/listinfo/users" target="_blank">http://clusterlabs.org/mailman/listinfo/users</a><br>
<br>
Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>
Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>
Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>
</blockquote></div><br></div>