<html><body><p>Hi.. <br><br>I am running a few corosync "passive mode" Redundant Ring Protocol (RRP) failure scenarios, where<br>my cluster has several remote-node VirtualDomain resources running on each node in the cluster, <br>which have been configured to allow Live Guest Migration (LGM) operations.  <br><br>While both corosync rings are active, if I drop ring0 on a given node where I have remote node (guests) running, <br>I noticed that the guest will be shutdown / re-started on the same host, after which the connection is re-established<br>and the guest proceeds to run on that same cluster node. <br><br>I am wondering why pacemaker doesn't try to "live" migrate the remote node (guest) to a different node, instead<br>of rebooting the guest?  Is there some way to configure the remote nodes such that the recovery action is<br>LGM instead of reboot when the host-to-remote_node connect is lost in an RRP situation?   I guess the<br>next question is, is it even possible to LGM a remote node guest if the corosync ring fails over from ring0 to ring1<br>(or vise-versa)?   <br><br># For example, here's a remote node's VirtualDomain resource definition. <br><br>[root@zs95kj]# pcs resource show  zs95kjg110102_res<br> Resource: zs95kjg110102_res (class=ocf provider=heartbeat type=VirtualDomain)<br>  Attributes: config=/guestxml/nfs1/zs95kjg110102.xml hypervisor=qemu:///system migration_transport=ssh<br>  Meta Attrs: allow-migrate=true remote-node=zs95kjg110102 remote-addr=10.20.110.102<br>  Operations: start interval=0s timeout=480 (zs95kjg110102_res-start-interval-0s)<br>              stop interval=0s timeout=120 (zs95kjg110102_res-stop-interval-0s)<br>              monitor interval=30s (zs95kjg110102_res-monitor-interval-30s)<br>              migrate-from interval=0s timeout=1200 (zs95kjg110102_res-migrate-from-interval-0s)<br>              migrate-to interval=0s timeout=1200 (zs95kjg110102_res-migrate-to-interval-0s)<br>[root@zs95kj VD]#<br><br><br><br><br># My RRP rings are active, and configured "rrp_mode="passive"<br><br>[root@zs95kj ~]# corosync-cfgtool -s<br>Printing ring status.<br>Local node ID 2<br>RING ID 0<br>        id      = 10.20.93.12<br>        status  = ring 0 active with no faults<br>RING ID 1<br>        id      = 10.20.94.212<br>        status  = ring 1 active with no faults<br><br><br><br># Here's the corosync.conf ..<br><br>[root@zs95kj ~]# cat /etc/corosync/corosync.conf<br>totem {<br>    version: 2<br>    secauth: off<br>    cluster_name: test_cluster_2<br>    transport: udpu<br>    rrp_mode: passive<br>}<br><br>nodelist {<br>    node {<br>        ring0_addr: zs95kjpcs1<br>        ring1_addr: zs95kjpcs2<br>        nodeid: 2<br>    }<br><br>    node {<br>        ring0_addr: zs95KLpcs1<br>        ring1_addr: zs95KLpcs2<br>        nodeid: 3<br>    }<br><br>    node {<br>        ring0_addr: zs90kppcs1<br>        ring1_addr: zs90kppcs2<br>        nodeid: 4<br>    }<br><br>    node {<br>        ring0_addr: zs93KLpcs1<br>        ring1_addr: zs93KLpcs2<br>        nodeid: 5<br>    }<br><br>    node {<br>        ring0_addr: zs93kjpcs1<br>        ring1_addr: zs93kjpcs2<br>        nodeid: 1<br>    }<br>}<br><br>quorum {<br>    provider: corosync_votequorum<br>}<br><br>logging {<br>    to_logfile: yes<br>    logfile: /var/log/corosync/corosync.log<br>    timestamp: on<br>    syslog_facility: daemon<br>    to_syslog: yes<br>    debug: on<br><br>    logger_subsys {<br>        debug: off<br>        subsys: QUORUM<br>    }<br>}<br><br><br><br><br># Here's the vlan / route situation on cluster node zs95kj: <br><br>ring0 is on vlan1293<br>ring1 is on vlan1294<br><br>[root@zs95kj ~]# route -n<br>Kernel IP routing table<br>Destination     Gateway         Genmask         Flags Metric Ref    Use Iface<br>0.0.0.0         10.20.93.254    0.0.0.0         UG    400    0        0 vlan1293  << default route to guests from ring0<br>9.0.0.0         9.12.23.1       255.0.0.0       UG    400    0        0 vlan508<br>9.12.23.0       0.0.0.0         255.255.255.0   U     400    0        0 vlan508<br>10.20.92.0      0.0.0.0         255.255.255.0   U     400    0        0 vlan1292<br>10.20.93.0      0.0.0.0         255.255.255.0   U     0      0        0 vlan1293  << ring0 IPs<br>10.20.93.0      0.0.0.0         255.255.255.0   U     400    0        0 vlan1293  <br>10.20.94.0      0.0.0.0         255.255.255.0   U     0      0        0 vlan1294   << ring1 IPs<br>10.20.94.0      0.0.0.0         255.255.255.0   U     400    0        0 vlan1294<br>10.20.101.0     0.0.0.0         255.255.255.0   U     400    0        0 vlan1298<br>10.20.109.0     10.20.94.254    255.255.255.0   UG    400    0        0 vlan1294  << Route to guests on 10.20.109 from ring1<br>10.20.110.0     10.20.94.254    255.255.255.0   UG    400    0        0 vlan1294  << Route to guests on 10.20.110 from ring1<br>169.254.0.0     0.0.0.0         255.255.0.0     U     1007   0        0 enccw0.0.02e0<br>169.254.0.0     0.0.0.0         255.255.0.0     U     1016   0        0 ovsbridge1<br>192.168.122.0   0.0.0.0         255.255.255.0   U     0      0        0 virbr0<br><br><br><br># On remote node, you can see we have a connection back to the host. <br><br>Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted:     info: crm_log_init:  Changed active directory to /var/lib/heartbeat/cores/root<br>Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted:     info: qb_ipcs_us_publish:    server name: lrmd<br>Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted:   notice: lrmd_init_remote_tls_server:   Starting a tls listener on port 3121.<br>Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted:   notice: bind_and_listen:       Listening on address ::<br>Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted:     info: qb_ipcs_us_publish:    server name: cib_ro<br>Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted:     info: qb_ipcs_us_publish:    server name: cib_rw<br>Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted:     info: qb_ipcs_us_publish:    server name: cib_shm<br>Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted:     info: qb_ipcs_us_publish:    server name: attrd<br>Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted:     info: qb_ipcs_us_publish:    server name: stonith-ng<br>Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted:     info: qb_ipcs_us_publish:    server name: crmd<br>Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted:     info: main:  Starting<br><font color="#0000FF">Feb 28 14:30:27 [928] zs95kjg110102 pacemaker_remoted:   notice: lrmd_remote_listen:    LRMD client connection established. 0x9ec18b50 id: 93e25ef0-4ff8-45ac-a6ed-f13b64588326</font><br><br>zs95kjg110102:~ # netstat -anp<br>Active Internet connections (servers and established)<br>Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name<br>tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      946/sshd<br>tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN      1022/master<br>tcp        0      0 0.0.0.0:5666            0.0.0.0:*               LISTEN      931/xinetd<br>tcp        0      0 0.0.0.0:5801            0.0.0.0:*               LISTEN      931/xinetd<br>tcp        0      0 0.0.0.0:5901            0.0.0.0:*               LISTEN      931/xinetd<br>tcp        0      0 :::21                   :::*                    LISTEN      926/vsftpd<br>tcp        0      0 :::22                   :::*                    LISTEN      946/sshd<br>tcp        0      0 ::1:25                  :::*                    LISTEN      1022/master<br>tcp        0      0 :::44931                :::*                    LISTEN      1068/xdm<br>tcp        0      0 :::80                   :::*                    LISTEN      929/httpd-prefork<br>tcp        0      0 :::3121                 :::*                    LISTEN      928/pacemaker_remot<br><b><font color="#0000FF">tcp        0      0 10.20.110.102:3121      10.20.93.12:46425       ESTABLISHED 928/pacemaker_remot</font></b><br>udp        0      0 :::177                  :::*                                1068/xdm<br><br><br><br><br>## Drop the ring0 (vlan1293) interface on cluster node zs95kj, causing fail over to ring1 (vlan1294)<br><br>[root@zs95kj]# date;ifdown vlan1293<br>Tue Feb 28 15:54:11 EST 2017<br>Device 'vlan1293' successfully disconnected.<br><br><br><br>## Confirm that ring0 is now offline (a.k.a. "FAULTY")<br><br>[root@zs95kj]# date;corosync-cfgtool -s<br>Tue Feb 28 15:54:49 EST 2017<br>Printing ring status.<br>Local node ID 2<br>RING ID 0<br>        id      = 10.20.93.12<br>        status  = Marking ringid 0 interface 10.20.93.12 FAULTY<br>RING ID 1<br>        id      = 10.20.94.212<br>        status  = ring 1 active with no faults<br>[root@zs95kj VD]#<br><br><br><br><br># See that the resource stayed local to cluster node zs95kj. <br><br>[root@zs95kj]# date;pcs resource show |grep zs95kjg110102<br>Tue Feb 28 15:55:32 EST 2017<br> zs95kjg110102_res      (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1<br>You have new mail in /var/spool/mail/root<br><br><br><br># On the remote node, show new entries in pacemaker.log showing connection re-established. <br><br>Feb 28 15:55:17 [928] zs95kjg110102 pacemaker_remoted:   notice: crm_signal_dispatch:   Invoking handler for signal 15: Terminated<br>Feb 28 15:55:17 [928] zs95kjg110102 pacemaker_remoted:     info: lrmd_shutdown: Terminating with  1 clients<br>Feb 28 15:55:17 [928] zs95kjg110102 pacemaker_remoted:     info: qb_ipcs_us_withdraw:   withdrawing server sockets<br>Feb 28 15:55:17 [928] zs95kjg110102 pacemaker_remoted:     info: crm_xml_cleanup:       Cleaning up memory from libxml2<br>Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted:     info: crm_log_init:  Changed active directory to /var/lib/heartbeat/cores/root<br>Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted:     info: qb_ipcs_us_publish:    server name: lrmd<br><font color="#0000FF">Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted:   notice: lrmd_init_remote_tls_server:   Starting a tls listener on port 3121.</font><br>Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted:   notice: bind_and_listen:       Listening on address ::<br>Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted:     info: qb_ipcs_us_publish:    server name: cib_ro<br>Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted:     info: qb_ipcs_us_publish:    server name: cib_rw<br>Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted:     info: qb_ipcs_us_publish:    server name: cib_shm<br>Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted:     info: qb_ipcs_us_publish:    server name: attrd<br>Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted:     info: qb_ipcs_us_publish:    server name: stonith-ng<br>Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted:     info: qb_ipcs_us_publish:    server name: crmd<br>Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted:     info: main:  Starting<br><font color="#0000FF">Feb 28 15:55:38 [942] zs95kjg110102 pacemaker_remoted:   notice: lrmd_remote_listen:    LRMD client connection established. 0xbed1ab50 id: b19ed532-6f61-4d9c-9439-ffb836eea34f</font><br>zs95kjg110102:~ #<br><br><br><br>zs95kjg110102:~ # netstat -anp |less<br>Active Internet connections (servers and established)<br>Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name<br>tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      961/sshd<br>tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN      1065/master<br>tcp        0      0 0.0.0.0:5666            0.0.0.0:*               LISTEN      946/xinetd<br>tcp        0      0 0.0.0.0:5801            0.0.0.0:*               LISTEN      946/xinetd<br>tcp        0      0 0.0.0.0:5901            0.0.0.0:*               LISTEN      946/xinetd<br>tcp        0      0 10.20.110.102:22        10.20.94.32:57749       ESTABLISHED 1134/0<br>tcp        0      0 :::21                   :::*                    LISTEN      941/vsftpd<br>tcp        0      0 :::22                   :::*                    LISTEN      961/sshd<br>tcp        0      0 ::1:25                  :::*                    LISTEN      1065/master<br>tcp        0      0 :::80                   :::*                    LISTEN      944/httpd-prefork<br>tcp        0      0 :::3121                 :::*                    LISTEN      942/pacemaker_remot<br>tcp        0      0 :::34836                :::*                    LISTEN      1070/xdm<br><b><font color="#0000FF">tcp        0      0 10.20.110.102:3121      10.20.94.212:49666      ESTABLISHED 942/pacemaker_remot</font></b><br>udp        0      0 :::177                  :::*                                1070/xdm<br><br><br><br>## On host node, zs95kj show system messages indicating remote node (guest) shutdown / start ...  (but no attempt to LGM). <br><br>[root@zs95kj ~]# grep "Feb 28" /var/log/messages |grep zs95kjg110102<br><br>Feb 28 15:55:07 zs95kj crmd[121380]:   error: Operation zs95kjg110102_monitor_30000: Timed Out (node=zs95kjpcs1, call=2, timeout=30000ms)<br>Feb 28 15:55:07 zs95kj crmd[121380]:   error: Unexpected disconnect on remote-node zs95kjg110102<br>Feb 28 15:55:17 zs95kj crmd[121380]:  notice: Operation zs95kjg110102_stop_0: ok (node=zs95kjpcs1, call=38, rc=0, cib-update=370, confirmed=true)<br>Feb 28 15:55:17 zs95kj attrd[121378]:  notice: Removing all zs95kjg110102 attributes for zs95kjpcs1<br>Feb 28 15:55:17 zs95kj VirtualDomain(zs95kjg110102_res)[173127]: INFO: Issuing graceful shutdown request for domain zs95kjg110102.<br>Feb 28 15:55:23 zs95kj systemd-machined: Machine qemu-38-zs95kjg110102 terminated.<br>Feb 28 15:55:23 zs95kj crmd[121380]:  notice: Operation zs95kjg110102_res_stop_0: ok (node=zs95kjpcs1, call=858, rc=0, cib-update=378, confirmed=true)<br>Feb 28 15:55:24 zs95kj systemd-machined: New machine qemu-64-zs95kjg110102.<br>Feb 28 15:55:24 zs95kj systemd: Started Virtual Machine qemu-64-zs95kjg110102.<br>Feb 28 15:55:24 zs95kj systemd: Starting Virtual Machine qemu-64-zs95kjg110102.<br>Feb 28 15:55:25 zs95kj crmd[121380]:  notice: Operation zs95kjg110102_res_start_0: ok (node=zs95kjpcs1, call=859, rc=0, cib-update=385, confirmed=true)<br>Feb 28 15:55:38 zs95kj crmd[121380]:  notice: Operation zs95kjg110102_start_0: ok (node=zs95kjpcs1, call=44, rc=0, cib-update=387, confirmed=true)<br>[root@zs95kj ~]#<br><br><br>Once the remote node established re-connection, there was no further remote node / resource instability. <br><br>Anyway, just wondering why there was no attempt to migrate this remote node guest as opposed to a reboot?   Is it necessary to reboot the guest in order to be managed<br>by pacemaker and corosync over the ring1 interface if ring0 goes down?    Is live guest migration even possible if ring0 goes away and ring1 takes over? <br><br>Thanks in advance.. <br><br>Scott Greenlese ... KVM on System Z - Solutions Test, IBM Poughkeepsie, N.Y.<br>  INTERNET:  swgreenl@us.ibm.com  <br><br><BR>
</body></html>