[ClusterLabs] Error at testing live migration

Sat Mar 28 00:36:58 EDT 2015

В Fri, 27 Mar 2015 16:40:18 -0500
Wilson Acero <rasalax at hotmail.com> пишет:

> Hi Ken, thanks for your answer. Before making the live migration tests I ran tests to see how Pacemaker manages the virtual machine shutdown. Using the command "pcs cluster standby nodoX" there were no errors, but rebooting or shutting down the node, the virtualmachine resource, gfs2wa and iscsi resources failed and the node became UNCLEAN. After a lot of tests I did modified my /usr/lib/systemd/system/corosync.service file, and add this entries.
> 
> After=iscsid.service
> After=remote-fs.target
> After=libvirtd.service
> 

You should never ever edit files under /usr/lib - they will be
overwritten on update. Assuming CentOS 7 is using modern enough systemd
to support drop-in directories - create directory

/etc/systemd/system/corosync.service.d

and file with suffix .conf in this directory will be merged into
service definition. Just do

mkdir -p /etc/systemd/system/corosync.service.d
echo "After=iscsid.service remote-fs.target libvirtd.service"
> /etc/systemd/system/corosync.service.d/local.conf

If CentOS 7 does not support dropins yet, just copy full definition
into /etc/systemd/system and edit it. 

But - why corosync? It has no business starting or stopping resources,
right? It is pacemaker that needs those dependencies, not corosync. I
guess it worked simply because pacemaker is itself ordered After
corosync and so is stopped before it.

> It solved the shutting down /reboot error, giving Pacemaker enough time to shutting down the virtual machine, restarting it on another node, and continue with the rebooting of the node, but when testing the live migration, it fails. 
> 
> I added your modification on /usr/lib/systemd/system/pacemaker.service, but it did not work. 
> 
> Searching about this error I found out that systemd now includes the service "systemd-machined.service" a service to monitor, start or shut down a virtual machine using the command machinectl. I tried to disable the process but libvirt needs it to run a virtual machine.

That is something to discuss on systemd list. I myself have zero
experience with this tool. Do I understand it correctly that to start
VM (even from within pacemaker) this service must be present and
running?

Then you should order pacemaker after this service; extend
pacemaker.service with

After=systemd-machined.service

(see above for correct way to do it). This should ensure that pacemaker
will be stopped before systemd-machined and have time to perform
whatever cleanup it needs.

> 
> [root at nodo3 system]# machinectl
> MACHINE                          CONTAINER SERVICE
> qemu-centos2                     vm        libvirt-qemu
> 
> 1 machines listed.
> [root at nodo3 system]#
> [root at nodo3 system]# systemctl status systemd-machined.service
> systemd-machined.service - Virtual Machine and Container Registration Service
>    Loaded: loaded (/usr/lib/systemd/system/systemd-machined.service; static)
>    Active: active (running) since Fri 2015-03-27 16:13:20 ECT; 22min ago
>      Docs: man:systemd-machined.service(8)
>            http://www.freedesktop.org/wiki/Software/systemd/machined
>  Main PID: 2982 (systemd-machine)
>    CGroup: /system.slice/systemd-machined.service
>            ââ2982 /usr/lib/systemd/systemd-machined
> 
> Mar 27 16:13:20 nodo3.redwa.local systemd[1]: Starting Virtual Machine and Container Registration Service...
> Mar 27 16:13:20 nodo3.redwa.local systemd[1]: Started Virtual Machine and Container Registration Service.
> Mar 27 16:13:20 nodo3.redwa.local systemd-machined[2982]: New machine qemu-centos2.
> 
> I guess that service is guilty, but I don't know how to deal with it. 
> 
> Thanks a lot. 
> 
> From: rasalax at hotmail.com
> To: users at clusterlabs.org
> Subject: Error at testing live migration
> Date: Fri, 27 Mar 2015 12:46:47 -0500
> 
> 
> 
> 
> Hi everybody, 
> I have a pacemaker + corosync cluster that manages a virtual machine (kvm) the virtual machine drives are stored in  a shared storage (gfs2 + lvm+ iscsi LUN). The resource agent is VirtualDomain. 
> When I test the live migration with a command  'pcs resource move vmcentos2 nodo2' or putting the node on standby, the migration works with no problem. 
> But when I want to test the live migration rebooting or  shutting down the node that runs the virtual machine, migration fails. Is this a expected behaviour or a bug?
> My cluster configuration is:
> OS=Centos 7 Pacemaker 1.1.10-32.el7_0.1Corosync Cluster Engine, version '2.3.3'
> [root at nodo2 ~]# pcs statusCluster name: clusterwaLast updated: Fri Mar 27 12:20:04 2015Last change: Thu Mar 26 16:11:11 2015 via crm_resource on nodo2Stack: corosyncCurrent DC: nodo2 (2) - partition with quorumVersion: 1.1.10-32.el7_0.1-368c7265 Nodes configured29 Resources configured
> Online: [ nodo2 nodo3 nodo4 ]Containers: [ centos1.7:vmcentos3 ]
> Full list of resources:
>  wti_wa (stonith:fence_wti):    Started nodo3 Clone Set: dlmwa-clone [dlmwa]     Started: [ nodo2 nodo3 nodo4 ]     Stopped: [ centos1.7 centosSC3 ] Clone Set: clvmwa-clone [clvmwa]     Started: [ nodo2 nodo3 nodo4 ]     Stopped: [ centos1.7 centosSC3 ] Clone Set: gfs2wa-clone [gfs2wa]     Started: [ nodo2 nodo3 nodo4 ]     Stopped: [ centos1.7 centosSC3 ] vmcentos2      (ocf::heartbeat:VirtualDomain): Started nodo2
>  Clone Set: iscsiwa-clone [iscsiwa]     Started: [ nodo2 nodo3 nodo4 ]     Stopped: [ centos1.7 centosSC3 ]
> PCSD Status:  nodo2: Online  nodo3: Online  nodo4: Online
> Daemon Status:  corosync: active/enabled  pacemaker: active/enabled  pcsd: active/enabled
> Many thanks. Many thanks.