[ClusterLabs] failed migration handled the wrong way

Mon Feb 1 04:53:17 EST 2021

Hi!

While fighting to get the wrong configuration, I broke libvirt live-migration by not enabling the TLS socket.

When testing to live-migrate a VM from h16 to h18, these are the essential events:
Feb 01 10:30:10 h16 pacemaker-schedulerd[7466]:  notice:  * Move       prm_cron_snap_test-jeos5      ( h16 -> h18                                                                   )
Feb 01 10:30:10 h16 pacemaker-schedulerd[7466]:  notice:  * Migrate    prm_xen_test-jeos5            ( h16 -> h18                                                                   )
Feb 01 10:30:10 h16 pacemaker-execd[7464]:  notice: executing - rsc:prm_cron_snap_test-jeos5 action:stop call_id:                                                                  227
Feb 01 10:30:10 h16 pacemaker-controld[7467]:  notice: Result of stop operation for prm_cron_snap_test-jeos5 on h                                                                  16: ok
Feb 01 10:30:10 h16 pacemaker-controld[7467]:  notice: Initiating migrate_to operation prm_xen_test-jeos5_migrate                                                                  _to_0 locally on h16
Feb 01 10:30:10 h16 pacemaker-execd[7464]:  notice: executing - rsc:prm_xen_test-jeos5 action:migrate_to call_id:                                                                  229
Feb 01 10:30:10 h16 VirtualDomain(prm_xen_test-jeos5)[33121]: INFO: test-jeos5: Starting live migration to h18 (u                                                                  sing: virsh --connect=xen:///system --quiet migrate --live  test-jeos5 xen+tls://h18.sap.klinik.uni-regensburg.de                                                                   ).
Feb 01 10:30:11 h16 libvirtd[28145]: internal error: Failed to send migration data to destination host
Feb 01 10:30:11 h16 VirtualDomain(prm_xen_test-jeos5)[33132]: ERROR: test-jeos5: live migration to h18 failed: 1
Feb 01 10:30:11 h16 pacemaker-execd[7464]:  notice: prm_xen_test-jeos5_migrate_to_0[33079] error output [ error:                                                                   internal error: Failed to send migration data to destination host ]
Feb 01 10:30:11 h16 pacemaker-execd[7464]:  notice: prm_xen_test-jeos5_migrate_to_0[33079] error output [ ocf-exi                                                                  t-reason:test-jeos5: live migration to h18 failed: 1 ]
Feb 01 10:30:11 h16 pacemaker-execd[7464]:  notice: prm_xen_test-jeos5 migrate_to (call 229, PID 33079) exited wi                                                                  th status 1 (execution time 693ms, queue time 0ms)
Feb 01 10:30:11 h16 pacemaker-controld[7467]:  notice: Result of migrate_to operation for prm_xen_test-jeos5 on h                                                                  16: error
Feb 01 10:30:11 h16 pacemaker-controld[7467]:  notice: h16-prm_xen_test-jeos5_migrate_to_0:229 [ error: internal                                                                   error: Failed to send migration data to destination host\nocf-exit-reason:test-jeos5: live migration to h18 faile                                                                  d: 1\n ]
Feb 01 10:30:11 h16 pacemaker-controld[7467]:  notice: Transition 743 aborted by operation prm_xen_test-jeos5_mig                                                                  rate_to_0 'modify' on h16: Event failed
Feb 01 10:30:11 h16 pacemaker-controld[7467]:  notice: Transition 743 action 120 (prm_xen_test-jeos5_migrate_to_0                                                                   on h16): expected 'ok' but got 'error'
Feb 01 10:30:11 h16 pacemaker-controld[7467]:  notice: Transition 743 (Complete=4, Pending=0, Fired=0, Skipped=0,                                                                   Incomplete=7, Source=/var/lib/pacemaker/pengine/pe-input-84.bz2): Complete
Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]:  warning: Unexpected result (error: test-jeos5: live migration to                                                                   h18 failed: 1) was recorded for migrate_to of prm_xen_test-jeos5 on h16 at Feb  1 10:30:10 2021
Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]:  warning: Unexpected result (error: test-jeos5: live migration to                                                                   h18 failed: 1) was recorded for migrate_to of prm_xen_test-jeos5 on h16 at Feb  1 10:30:10 2021
Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]:  error: Resource prm_xen_test-jeos5 is active on 2 nodes (attempt                                                                  ing recovery)
Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]:  notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_is_To                                                                  o_Active for more information
Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]:  notice:  * Start      prm_cron_snap_test-jeos5      (                                                                               h18 )
Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]:  notice:  * Recover    prm_xen_test-jeos5            (                                                                               h18 )
Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]:  error: Calculated transition 744 (with errors), saving inputs in                                                                   /var/lib/pacemaker/pengine/pe-error-6.bz2
Feb 01 10:30:11 h16 pacemaker-controld[7467]:  notice: Initiating stop operation prm_xen_test-jeos5_stop_0 on h18
Feb 01 10:30:11 h16 pacemaker-controld[7467]:  notice: Initiating stop operation prm_xen_test-jeos5_stop_0 locall                                                                  y on h16
Feb 01 10:30:11 h16 pacemaker-execd[7464]:  notice: executing - rsc:prm_xen_test-jeos5 action:stop call_id:230
Feb 01 10:30:11 h16 VirtualDomain(prm_xen_test-jeos5)[33170]: INFO: Issuing graceful shutdown request for domain                                                                   test-jeos5.
Feb 01 10:31:11 h16 libvirtd[28145]: internal error: Failed to shutdown domain '13' with libxenlight
Feb 01 10:33:06 h16 VirtualDomain(prm_xen_test-jeos5)[34194]: INFO: Issuing forced shutdown (destroy) request for                                                                   domain test-jeos5.
Feb 01 10:33:08 h16 pacemaker-execd[7464]:  notice: prm_xen_test-jeos5_stop_0[33137] error output [ error: Failed                                                                   to shutdown domain test-jeos5 ]
Feb 01 10:33:08 h16 pacemaker-execd[7464]:  notice: prm_xen_test-jeos5_stop_0[33137] error output [ error: intern                                                                  al error: Failed to shutdown domain '13' with libxenlight ]
Feb 01 10:33:08 h16 pacemaker-execd[7464]:  notice: prm_xen_test-jeos5_stop_0[33137] error output [  ]
Feb 01 10:33:08 h16 pacemaker-execd[7464]:  notice: prm_xen_test-jeos5 stop (call 230, PID 33137) exited with sta                                                                  tus 0 (execution time 177112ms, queue time 0ms)

### Shouldn't the result be error?

Fortunately locking prevented duplicate activation of h18:
Feb 01 10:32:51 h18 systemd[1]: Started Virtualization daemon.
Feb 01 10:32:52 h18 virtlockd[9904]: Requested operation is not valid: Lockspace for path /var/lib/libvirt/lockd/                                                                  files already exists
Feb 01 10:32:52 h18 virtlockd[9904]: Requested operation is not valid: Lockspace for path /var/lib/libvirt/lockd/                                                                  lvmvolumes already exists
Feb 01 10:32:52 h18 virtlockd[9904]: Requested operation is not valid: Lockspace for path /var/lib/libvirt/lockd/                                                                  scsivolumes already exists

So the main issue seems that a failed forced stop returned "success", causing a "recoveer" on h18 while the VM still runs on h16.

h16:~ # rpm -qf /usr/lib/ocf/resource.d/heartbeat/VirtualDomain
resource-agents-4.4.0+git57.70549516-3.12.1.x86_64

(SLES15 SP2)

Regards,
Ulrich