[ClusterLabs] Antw: Re: Three VM's in cluster, running on multiple libvirt hosts, stonith not working
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Wed Jun 3 06:20:07 UTC 2015
>>> Steve Dainard <sdainard at spd1.com> schrieb am 02.06.2015 um 21:40 in Nachricht
<CAEMJtDs3vq4UZtb1DJioGP3w-JaedqWW5vHPhMvf3Tj7mHB9ew at mail.gmail.com>:
> Hi Ken,
>
> I've tried configuring without pcmk_host_list as well with the same result.
I can't help here, sorry. But: Is there a mechanism to trigger fencing of a specific node through the cluster manually? That would help testing, I guess.
What would be the command-line to run a "fencing RA"?
Regards,
Ulrich
>
> Stonith Devices:
> Resource: NFS1 (class=stonith type=fence_xvm)
> Attributes: key_file=/etc/cluster/fence_xvm_ceph1.key
> multicast_address=225.0.0.12 port=NFS1
> Operations: monitor interval=20s (NFS1-monitor-interval-20s)
> Resource: NFS2 (class=stonith type=fence_xvm)
> Attributes: key_file=/etc/cluster/fence_xvm_ceph2.key
> multicast_address=225.0.1.12 port=NFS2
> Operations: monitor interval=20s (NFS2-monitor-interval-20s)
> Resource: NFS3 (class=stonith type=fence_xvm)
> Attributes: key_file=/etc/cluster/fence_xvm_ceph3.key
> multicast_address=225.0.2.12 port=NFS3
> Operations: monitor interval=20s (NFS3-monitor-interval-20s)
>
> I can get the list of VM's from any of the 3 cluster nodes using the
> multicast address:
>
> # fence_xvm -a 225.0.0.12 -k /etc/cluster/fence_xvm_ceph1.key -o list
> NFS1 1814d93d-3e40-797f-a3c6-102aaa6a3d01 on
>
> # fence_xvm -a 225.0.1.12 -k /etc/cluster/fence_xvm_ceph2.key -o list
> NFS2 75ab85fc-40e9-45ae-8b0a-c346d59b24e8 on
>
> # fence_xvm -a 225.0.2.12 -k /etc/cluster/fence_xvm_ceph3.key -o list
> NFS3 f23cca5d-d50b-46d2-85dd-d8357337fd22 on
>
> On Tue, Jun 2, 2015 at 10:07 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
>
>> On 06/02/2015 11:40 AM, Steve Dainard wrote:
>> > Hello,
>> >
>> > I have 3 CentOS7 guests running on 3 CentOS7 hypervisors and I can't get
>> > stonith operations to work.
>> >
>> > Config:
>> >
>> > Cluster Name: nfs
>> > Corosync Nodes:
>> > node1 node2 node3
>> > Pacemaker Nodes:
>> > node1 node2 node3
>> >
>> > Resources:
>> > Group: group_rbd_fs_nfs_vip
>> > Resource: rbd_nfs-ha (class=ocf provider=ceph type=rbd.in)
>> > Attributes: user=admin pool=rbd name=nfs-ha
>> cephconf=/etc/ceph/ceph.conf
>> > Operations: start interval=0s timeout=20 (rbd_nfs-ha-start-timeout-20)
>> > stop interval=0s timeout=20 (rbd_nfs-ha-stop-timeout-20)
>> > monitor interval=10s timeout=20s
>> > (rbd_nfs-ha-monitor-interval-10s)
>> > Resource: rbd_home (class=ocf provider=ceph type=rbd.in)
>> > Attributes: user=admin pool=rbd name=home cephconf=/etc/ceph/ceph.conf
>> > Operations: start interval=0s timeout=20 (rbd_home-start-timeout-20)
>> > stop interval=0s timeout=20 (rbd_home-stop-timeout-20)
>> > monitor interval=10s timeout=20s
>> > (rbd_home-monitor-interval-10s)
>> > Resource: fs_nfs-ha (class=ocf provider=heartbeat type=Filesystem)
>> > Attributes: directory=/mnt/nfs-ha fstype=btrfs
>> > device=/dev/rbd/rbd/nfs-ha fast_stop=no
>> > Operations: monitor interval=20s timeout=40s
>> > (fs_nfs-ha-monitor-interval-20s)
>> > start interval=0 timeout=60s (fs_nfs-ha-start-interval-0)
>> > stop interval=0 timeout=60s (fs_nfs-ha-stop-interval-0)
>> > Resource: FS_home (class=ocf provider=heartbeat type=Filesystem)
>> > Attributes: directory=/mnt/home fstype=btrfs device=/dev/rbd/rbd/home
>> > options=rw,compress-force=lzo fast_stop=no
>> > Operations: monitor interval=20s timeout=40s
>> > (FS_home-monitor-interval-20s)
>> > start interval=0 timeout=60s (FS_home-start-interval-0)
>> > stop interval=0 timeout=60s (FS_home-stop-interval-0)
>> > Resource: nfsserver (class=ocf provider=heartbeat type=nfsserver)
>> > Attributes: nfs_shared_infodir=/mnt/nfs-ha
>> > Operations: stop interval=0s timeout=20s (nfsserver-stop-timeout-20s)
>> > monitor interval=10s timeout=20s
>> > (nfsserver-monitor-interval-10s)
>> > start interval=0 timeout=40s (nfsserver-start-interval-0)
>> > Resource: vip_nfs_private (class=ocf provider=heartbeat type=IPaddr)
>> > Attributes: ip=10.0.231.49 cidr_netmask=24
>> > Operations: start interval=0s timeout=20s
>> > (vip_nfs_private-start-timeout-20s)
>> > stop interval=0s timeout=20s
>> > (vip_nfs_private-stop-timeout-20s)
>> > monitor interval=5 (vip_nfs_private-monitor-interval-5)
>> >
>> > Stonith Devices:
>> > Resource: NFS1 (class=stonith type=fence_xvm)
>> > Attributes: pcmk_host_list=10.0.231.50
>> > key_file=/etc/cluster/fence_xvm_ceph1.key multicast_address=225.0.0.12
>> > port=NFS1
>> > Operations: monitor interval=20s (NFS1-monitor-interval-20s)
>> > Resource: NFS2 (class=stonith type=fence_xvm)
>> > Attributes: pcmk_host_list=10.0.231.51
>> > key_file=/etc/cluster/fence_xvm_ceph2.key multicast_address=225.0.1.12
>> > port=NFS2
>> > Operations: monitor interval=20s (NFS2-monitor-interval-20s)
>> > Resource: NFS3 (class=stonith type=fence_xvm)
>> > Attributes: pcmk_host_list=10.0.231.52
>> > key_file=/etc/cluster/fence_xvm_ceph3.key multicast_address=225.0.2.12
>> > port=NFS3
>>
>> I think pcmk_host_list should have the node name rather than the IP
>> address. If fence_xvm -o list -a whatever shows the right nodes to
>> fence, you don't even need to set pcmk_host_list.
>>
>> > Operations: monitor interval=20s (NFS3-monitor-interval-20s)
>> > Fencing Levels:
>> >
>> > Location Constraints:
>> > Resource: NFS1
>> > Enabled on: node1 (score:1) (id:location-NFS1-node1-1)
>> > Enabled on: node2 (score:1000) (id:location-NFS1-node2-1000)
>> > Enabled on: node3 (score:500) (id:location-NFS1-node3-500)
>> > Resource: NFS2
>> > Enabled on: node2 (score:1) (id:location-NFS2-node2-1)
>> > Enabled on: node3 (score:1000) (id:location-NFS2-node3-1000)
>> > Enabled on: node1 (score:500) (id:location-NFS2-node1-500)
>> > Resource: NFS3
>> > Enabled on: node3 (score:1) (id:location-NFS3-node3-1)
>> > Enabled on: node1 (score:1000) (id:location-NFS3-node1-1000)
>> > Enabled on: node2 (score:500) (id:location-NFS3-node2-500)
>> > Ordering Constraints:
>> > Colocation Constraints:
>> >
>> > Cluster Properties:
>> > cluster-infrastructure: corosync
>> > cluster-name: nfs
>> > dc-version: 1.1.12-a14efad
>> > have-watchdog: false
>> > stonith-enabled: true
>> >
>> > When I stop networking services on node1 (stonith resource NFS1) I see
>> logs
>> > on the other two cluster nodes attempting to reboot the vm NFS1 without
>> > success.
>> >
>> > Logs:
>> >
>> > Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca pengine: notice:
>> LogActions:
>> > Move rbd_nfs-ha (Started node1 -> node2)
>> > Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca pengine: notice:
>> LogActions:
>> > Move rbd_home (Started node1 -> node2)
>> > Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca pengine: notice:
>> LogActions:
>> > Move fs_nfs-ha (Started node1 -> node2)
>> > Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca pengine: notice:
>> LogActions:
>> > Move FS_home (Started node1 -> node2)
>> > Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca pengine: notice:
>> LogActions:
>> > Move nfsserver (Started node1 -> node2)
>> > Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca pengine: notice:
>> LogActions:
>> > Move vip_nfs_private (Started node1 -> node2)
>> > Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca pengine: info:
>> LogActions:
>> > Leave NFS1 (Started node2)
>> > Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca pengine: info:
>> LogActions:
>> > Leave NFS2 (Started node3)
>> > Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca pengine: notice:
>> LogActions:
>> > Move NFS3 (Started node1 -> node2)
>> > Jun 01 15:38:17 [2130] nfs3.pcic.uvic.ca pengine: warning:
>> > process_pe_message: Calculated Transition 8:
>> > /var/lib/pacemaker/pengine/pe-warn-0.bz2
>> > Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca crmd: info:
>> > do_state_transition: State transition S_POLICY_ENGINE ->
>> > S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
>> > origin=handle_response ]
>> > Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca crmd: info:
>> > do_te_invoke: Processing graph 8 (ref=pe_calc-dc-1433198297-78)
>> derived
>> > from /var/lib/pacemaker/pengine/pe-warn-0.bz2
>> > Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca crmd: notice:
>> > te_fence_node: Executing reboot fencing operation (37) on node1
>> > (timeout=60000)
>> > Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng: notice:
>> > handle_request: Client crmd.2131.f7e79b61 wants to fence (reboot)
>> 'node1'
>> > with device '(any)'
>> > Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng: notice:
>> > initiate_remote_stonith_op: Initiating remote operation reboot for
>> > node1: a22a16f3-b699-453e-a090-43a640dd0e3f (0)
>> > Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng: notice:
>> > can_fence_host_with_device: NFS1 can not fence (reboot) node1:
>> > static-list
>> > Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng: notice:
>> > can_fence_host_with_device: NFS2 can not fence (reboot) node1:
>> > static-list
>> > Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng: notice:
>> > can_fence_host_with_device: NFS3 can not fence (reboot) node1:
>> > static-list
>> > Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng: info:
>> > process_remote_stonith_query: All queries have arrived, continuing (2,
>> > 2, 2, a22a16f3-b699-453e-a090-43a640dd0e3f)
>> > Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng: notice:
>> > stonith_choose_peer: Couldn't find anyone to fence node1 with <any>
>> > Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng: info:
>> > call_remote_stonith: Total remote op timeout set to 60 for fencing of
>> > node node1 for crmd.2131.a22a16f3
>> > Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng: info:
>> > call_remote_stonith: None of the 2 peers have devices capable of
>> > terminating node1 for crmd.2131 (0)
>> > Jun 01 15:38:17 [2127] nfs3.pcic.uvic.ca stonith-ng: error:
>> > remote_op_done: Operation reboot of node1 by <no-one> for
>> > crmd.2131 at node3.a22a16f3: No such device
>> > Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca crmd: notice:
>> > tengine_stonith_callback: Stonith operation
>> > 2/37:8:0:241ee032-f3a1-4c2b-8427-63af83b54343: No such device (-19)
>> > Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca crmd: notice:
>> > tengine_stonith_callback: Stonith operation 2 for node1 failed (No
>> > such device): aborting transition.
>> > Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca crmd: notice:
>> > abort_transition_graph: Transition aborted: Stonith failed
>> > (source=tengine_stonith_callback:697, 0)
>> > Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca crmd: notice:
>> > tengine_stonith_notify: Peer node1 was not terminated (reboot) by
>> <anyone>
>> > for node3: No such device (ref=a22a16f3-b699-453e-a090-43a640dd0e3f) by
>> > client crmd.2131
>> > Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca crmd: notice:
>> run_graph:
>> > Transition 8 (Complete=1, Pending=0, Fired=0, Skipped=27,
>> Incomplete=0,
>> > Source=/var/lib/pacemaker/pengine/pe-warn-0.bz2): Stopped
>> > Jun 01 15:38:17 [2131] nfs3.pcic.uvic.ca crmd: notice:
>> > too_many_st_failures: No devices found in cluster to fence node1,
>> giving
>> > up
>> >
>> > I can manually fence a guest without any issue:
>> > # fence_xvm -a 225.0.0.12 -k /etc/cluster/fence_xvm_ceph1.key -o reboot
>> -H
>> > NFS1
>> >
>> > But the cluster doesn't recover resources to another host:
>>
>> The cluster doesn't know that the manual fencing succeeded, so it plays
>> it safe by not moving resources. If you fix the cluster fencing issue,
>> I'd expect this to work.
>>
>> > # pcs status *<-- after manual fencing*
>> > Cluster name: nfs
>> > Last updated: Tue Jun 2 08:34:18 2015
>> > Last change: Mon Jun 1 16:02:58 2015
>> > Stack: corosync
>> > Current DC: node3 (3) - partition with quorum
>> > Version: 1.1.12-a14efad
>> > 3 Nodes configured
>> > 9 Resources configured
>> >
>> >
>> > Node node1 (1): UNCLEAN (offline)
>> > Online: [ node2 node3 ]
>> >
>> > Full list of resources:
>> >
>> > Resource Group: group_rbd_fs_nfs_vip
>> > rbd_nfs-ha (ocf::ceph:rbd.in): Started node1
>> > rbd_home (ocf::ceph:rbd.in): Started node1
>> > fs_nfs-ha (ocf::heartbeat:Filesystem): Started node1
>> > FS_home (ocf::heartbeat:Filesystem): Started node1
>> > nfsserver (ocf::heartbeat:nfsserver): Started node1
>> > vip_nfs_private (ocf::heartbeat:IPaddr): Started node1
>> > NFS1 (stonith:fence_xvm): Started node2
>> > NFS2 (stonith:fence_xvm): Started node3
>> > NFS3 (stonith:fence_xvm): Started node1
>> >
>> > PCSD Status:
>> > node1: Online
>> > node2: Online
>> > node3: Online
>> >
>> > Daemon Status:
>> > corosync: active/disabled
>> > pacemaker: active/disabled
>> > pcsd: active/enabled
>> >
>> > Fence_virtd config on one of the hypervisors:
>> > # cat fence_virt.conf
>> > backends {
>> > libvirt {
>> > uri = "qemu:///system";
>> > }
>> >
>> > }
>> >
>> > listeners {
>> > multicast {
>> > port = "1229";
>> > family = "ipv4";
>> > interface = "br1";
>> > address = "225.0.0.12";
>> > key_file = "/etc/cluster/fence_xvm_ceph1.key";
>> > }
>> >
>> > }
>> >
>> > fence_virtd {
>> > module_path = "/usr/lib64/fence-virt";
>> > backend = "libvirt";
>> > listener = "multicast";
>> > }
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
More information about the Users
mailing list