<html><body><p>Hi.. <br><br>I am running a few corosync "passive mode" Redundant Ring Protocol (RRP) failure scenarios, where<br>my cluster has several remote-node VirtualDomain resources running on each node in the cluster, <br>which have been configured to allow Live Guest Migration (LGM) operations. <br><br>While both corosync rings are active, if I drop ring0 on a given node where I have remote node (guests) running, <br>I noticed that the guest will be shutdown / re-started on the same host, after which the connection is re-established<br>and the guest proceeds to run on that same cluster node. <br><br>I am wondering why pacemaker doesn't try to "live" migrate the remote node (guest) to a different node, instead<br>of rebooting the guest? Is there some way to configure the remote nodes such that the recovery action is<br>LGM instead of reboot when the host-to-remote_node connect is lost in an RRP situation? I guess the<br>next question is, is it even possible to LGM a remote node guest if the corosync ring fails over from ring0 to ring1<br>(or vise-versa)? <br><br># For example, here's a remote node's VirtualDomain resource definition. <br><br>[root@zs95kj]# pcs resource show zs95kjg110102_res<br> Resource: zs95kjg110102_res (class=ocf provider=heartbeat type=VirtualDomain)<br> Attributes: config=/guestxml/nfs1/zs95kjg110102.xml hypervisor=qemu:///system migration_transport=ssh<br> Meta Attrs: allow-migrate=true remote-node=zs95kjg110102 remote-addr=10.20.110.102<br> Operations: start interval=0s timeout=480 (zs95kjg110102_res-start-interval-0s)<br> stop interval=0s timeout=120 (zs95kjg110102_res-stop-interval-0s)<br> monitor interval=30s (zs95kjg110102_res-monitor-interval-30s)<br> migrate-from interval=0s timeout=1200 (zs95kjg110102_res-migrate-from-interval-0s)<br> migrate-to interval=0s timeout=1200 (zs95kjg110102_res-migrate-to-interval-0s)<br>[root@zs95kj VD]#<br><br><br><br><br># My RRP rings are active, and configured "rrp_mode="passive"<br><br>[root@zs95kj ~]# corosync-cfgtool -s<br>Printing ring status.<br>Local node ID 2<br>RING ID 0<br> id = 10.20.93.12<br> status = ring 0 active with no faults<br>RING ID 1<br> id = 10.20.94.212<br> status = ring 1 active with no faults<br><br><br><br># Here's the corosync.conf ..<br><br>[root@zs95kj ~]# cat /etc/corosync/corosync.conf<br>totem {<br> version: 2<br> secauth: off<br> cluster_name: test_cluster_2<br> transport: udpu<br> rrp_mode: passive<br>}<br><br>nodelist {<br> node {<br> ring0_addr: zs95kjpcs1<br> ring1_addr: zs95kjpcs2<br> nodeid: 2<br> }<br><br> node {<br> ring0_addr: zs95KLpcs1<br> ring1_addr: zs95KLpcs2<br> nodeid: 3<br> }<br><br> node {<br> ring0_addr: zs90kppcs1<br> ring1_addr: zs90kppcs2<br> nodeid: 4<br> }<br><br> node {<br> ring0_addr: zs93KLpcs1<br> ring1_addr: zs93KLpcs2<br> nodeid: 5<br> }<br><br> node {<br> ring0_addr: zs93kjpcs1<br> ring1_addr: zs93kjpcs2<br> nodeid: 1<br> }<br>}<br><br>quorum {<br> provider: corosync_votequorum<br>}<br><br>logging {<br> to_logfile: yes<br> logfile: /var/log/corosync/corosync.log<br> timestamp: on<br> syslog_facility: daemon<br> to_syslog: yes<br> debug: on<br><br> logger_subsys {<br> debug: off<br> subsys: QUORUM<br> }<br>}<br><br><br><br><br># Here's the vlan / route situation on cluster node zs95kj: <br><br>ring0 is on vlan1293<br>ring1 is on vlan1294<br><br>[root@zs95kj ~]# route -n<br>Kernel IP routing table<br>Destination Gateway Genmask Flags Metric Ref Use Iface<br>0.0.0.0 10.20.93.254 0.0.0.0 UG 400 0 0 vlan1293 << default route to guests from ring0<br>9.0.0.0 9.12.23.1 255.0.0.0 UG 400 0 0 vlan508<br>9.12.23.0 0.0.0.0 255.255.255.0 U 400 0 0 vlan508<br>10.20.92.0 0.0.0.0 255.255.255.0 U 400 0 0 vlan1292<br>10.20.93.0 0.0.0.0 255.255.255.0 U 0 0 0 vlan1293 << ring0 IPs<br>10.20.93.0 0.0.0.0 255.255.255.0 U 400 0 0 vlan1293 <br>10.20.94.0 0.0.0.0 255.255.255.0 U 0 0 0 vlan1294 << ring1 IPs<br>10.20.94.0 0.0.0.0 255.255.255.0 U 400 0 0 vlan1294<br>10.20.101.0 0.0.0.0 255.255.255.0 U 400 0 0 vlan1298<br>10.20.109.0 10.20.94.254 255.255.255.0 UG 400 0 0 vlan1294 << Route to guests on 10.20.109 from ring1<br>10.20.110.0 10.20.94.254 255.255.255.0 UG 400 0 0 vlan1294 << Route to guests on 10.20.110 from ring1<br>169.254.0.0 0.0.0.0 255.255.0.0 U 1007 0 0 enccw0.0.02e0<br>169.254.0.0 0.0.0.0 255.255.0.0 U 1016 0 0 ovsbridge1<br>192.168.122.0 0.0.0.0 255.255.255.0 U 0 0 0 virbr0<br><br><br><br># On remote node, you can see we have a connection back to the host. <br><br>Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted: info: crm_log_init: Changed active directory to /var/lib/heartbeat/cores/root<br>Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted: info: qb_ipcs_us_publish: server name: lrmd<br>Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted: notice: lrmd_init_remote_tls_server: Starting a tls listener on port 3121.<br>Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted: notice: bind_and_listen: Listening on address ::<br>Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted: info: qb_ipcs_us_publish: server name: cib_ro<br>Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted: info: qb_ipcs_us_publish: server name: cib_rw<br>Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted: info: qb_ipcs_us_publish: server name: cib_shm<br>Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted: info: qb_ipcs_us_publish: server name: attrd<br>Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted: info: qb_ipcs_us_publish: server name: stonith-ng<br>Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted: info: qb_ipcs_us_publish: server name: crmd<br>Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted: info: main: Starting<br><font color="#0000FF">Feb 28 14:30:27 [928] zs95kjg110102 pacemaker_remoted: notice: lrmd_remote_listen: LRMD client connection established. 0x9ec18b50 id: 93e25ef0-4ff8-45ac-a6ed-f13b64588326</font><br><br>zs95kjg110102:~ # netstat -anp<br>Active Internet connections (servers and established)<br>Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name<br>tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 946/sshd<br>tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN 1022/master<br>tcp 0 0 0.0.0.0:5666 0.0.0.0:* LISTEN 931/xinetd<br>tcp 0 0 0.0.0.0:5801 0.0.0.0:* LISTEN 931/xinetd<br>tcp 0 0 0.0.0.0:5901 0.0.0.0:* LISTEN 931/xinetd<br>tcp 0 0 :::21 :::* LISTEN 926/vsftpd<br>tcp 0 0 :::22 :::* LISTEN 946/sshd<br>tcp 0 0 ::1:25 :::* LISTEN 1022/master<br>tcp 0 0 :::44931 :::* LISTEN 1068/xdm<br>tcp 0 0 :::80 :::* LISTEN 929/httpd-prefork<br>tcp 0 0 :::3121 :::* LISTEN 928/pacemaker_remot<br><b><font color="#0000FF">tcp 0 0 10.20.110.102:3121 10.20.93.12:46425 ESTABLISHED 928/pacemaker_remot</font></b><br>udp 0 0 :::177 :::* 1068/xdm<br><br><br><br><br>## Drop the ring0 (vlan1293) interface on cluster node zs95kj, causing fail over to ring1 (vlan1294)<br><br>[root@zs95kj]# date;ifdown vlan1293<br>Tue Feb 28 15:54:11 EST 2017<br>Device 'vlan1293' successfully disconnected.<br><br><br><br>## Confirm that ring0 is now offline (a.k.a. "FAULTY")<br><br>[root@zs95kj]# date;corosync-cfgtool -s<br>Tue Feb 28 15:54:49 EST 2017<br>Printing ring status.<br>Local node ID 2<br>RING ID 0<br> id = 10.20.93.12<br> status = Marking ringid 0 interface 10.20.93.12 FAULTY<br>RING ID 1<br> id = 10.20.94.212<br> status = ring 1 active with no faults<br>[root@zs95kj VD]#<br><br><br><br><br># See that the resource stayed local to cluster node zs95kj. <br><br>[root@zs95kj]# date;pcs resource show |grep zs95kjg110102<br>Tue Feb 28 15:55:32 EST 2017<br> zs95kjg110102_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1<br>You have new mail in /var/spool/mail/root<br><br><br><br># On the remote node, show new entries in pacemaker.log showing connection re-established. <br><br>Feb 28 15:55:17 [928] zs95kjg110102 pacemaker_remoted: notice: crm_signal_dispatch: Invoking handler for signal 15: Terminated<br>Feb 28 15:55:17 [928] zs95kjg110102 pacemaker_remoted: info: lrmd_shutdown: Terminating with 1 clients<br>Feb 28 15:55:17 [928] zs95kjg110102 pacemaker_remoted: info: qb_ipcs_us_withdraw: withdrawing server sockets<br>Feb 28 15:55:17 [928] zs95kjg110102 pacemaker_remoted: info: crm_xml_cleanup: Cleaning up memory from libxml2<br>Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted: info: crm_log_init: Changed active directory to /var/lib/heartbeat/cores/root<br>Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted: info: qb_ipcs_us_publish: server name: lrmd<br><font color="#0000FF">Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted: notice: lrmd_init_remote_tls_server: Starting a tls listener on port 3121.</font><br>Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted: notice: bind_and_listen: Listening on address ::<br>Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted: info: qb_ipcs_us_publish: server name: cib_ro<br>Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted: info: qb_ipcs_us_publish: server name: cib_rw<br>Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted: info: qb_ipcs_us_publish: server name: cib_shm<br>Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted: info: qb_ipcs_us_publish: server name: attrd<br>Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted: info: qb_ipcs_us_publish: server name: stonith-ng<br>Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted: info: qb_ipcs_us_publish: server name: crmd<br>Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted: info: main: Starting<br><font color="#0000FF">Feb 28 15:55:38 [942] zs95kjg110102 pacemaker_remoted: notice: lrmd_remote_listen: LRMD client connection established. 0xbed1ab50 id: b19ed532-6f61-4d9c-9439-ffb836eea34f</font><br>zs95kjg110102:~ #<br><br><br><br>zs95kjg110102:~ # netstat -anp |less<br>Active Internet connections (servers and established)<br>Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name<br>tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 961/sshd<br>tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN 1065/master<br>tcp 0 0 0.0.0.0:5666 0.0.0.0:* LISTEN 946/xinetd<br>tcp 0 0 0.0.0.0:5801 0.0.0.0:* LISTEN 946/xinetd<br>tcp 0 0 0.0.0.0:5901 0.0.0.0:* LISTEN 946/xinetd<br>tcp 0 0 10.20.110.102:22 10.20.94.32:57749 ESTABLISHED 1134/0<br>tcp 0 0 :::21 :::* LISTEN 941/vsftpd<br>tcp 0 0 :::22 :::* LISTEN 961/sshd<br>tcp 0 0 ::1:25 :::* LISTEN 1065/master<br>tcp 0 0 :::80 :::* LISTEN 944/httpd-prefork<br>tcp 0 0 :::3121 :::* LISTEN 942/pacemaker_remot<br>tcp 0 0 :::34836 :::* LISTEN 1070/xdm<br><b><font color="#0000FF">tcp 0 0 10.20.110.102:3121 10.20.94.212:49666 ESTABLISHED 942/pacemaker_remot</font></b><br>udp 0 0 :::177 :::* 1070/xdm<br><br><br><br>## On host node, zs95kj show system messages indicating remote node (guest) shutdown / start ... (but no attempt to LGM). <br><br>[root@zs95kj ~]# grep "Feb 28" /var/log/messages |grep zs95kjg110102<br><br>Feb 28 15:55:07 zs95kj crmd[121380]: error: Operation zs95kjg110102_monitor_30000: Timed Out (node=zs95kjpcs1, call=2, timeout=30000ms)<br>Feb 28 15:55:07 zs95kj crmd[121380]: error: Unexpected disconnect on remote-node zs95kjg110102<br>Feb 28 15:55:17 zs95kj crmd[121380]: notice: Operation zs95kjg110102_stop_0: ok (node=zs95kjpcs1, call=38, rc=0, cib-update=370, confirmed=true)<br>Feb 28 15:55:17 zs95kj attrd[121378]: notice: Removing all zs95kjg110102 attributes for zs95kjpcs1<br>Feb 28 15:55:17 zs95kj VirtualDomain(zs95kjg110102_res)[173127]: INFO: Issuing graceful shutdown request for domain zs95kjg110102.<br>Feb 28 15:55:23 zs95kj systemd-machined: Machine qemu-38-zs95kjg110102 terminated.<br>Feb 28 15:55:23 zs95kj crmd[121380]: notice: Operation zs95kjg110102_res_stop_0: ok (node=zs95kjpcs1, call=858, rc=0, cib-update=378, confirmed=true)<br>Feb 28 15:55:24 zs95kj systemd-machined: New machine qemu-64-zs95kjg110102.<br>Feb 28 15:55:24 zs95kj systemd: Started Virtual Machine qemu-64-zs95kjg110102.<br>Feb 28 15:55:24 zs95kj systemd: Starting Virtual Machine qemu-64-zs95kjg110102.<br>Feb 28 15:55:25 zs95kj crmd[121380]: notice: Operation zs95kjg110102_res_start_0: ok (node=zs95kjpcs1, call=859, rc=0, cib-update=385, confirmed=true)<br>Feb 28 15:55:38 zs95kj crmd[121380]: notice: Operation zs95kjg110102_start_0: ok (node=zs95kjpcs1, call=44, rc=0, cib-update=387, confirmed=true)<br>[root@zs95kj ~]#<br><br><br>Once the remote node established re-connection, there was no further remote node / resource instability. <br><br>Anyway, just wondering why there was no attempt to migrate this remote node guest as opposed to a reboot? Is it necessary to reboot the guest in order to be managed<br>by pacemaker and corosync over the ring1 interface if ring0 goes down? Is live guest migration even possible if ring0 goes away and ring1 takes over? <br><br>Thanks in advance.. <br><br>Scott Greenlese ... KVM on System Z - Solutions Test, IBM Poughkeepsie, N.Y.<br> INTERNET: swgreenl@us.ibm.com <br><br><BR>
</body></html>