[ClusterLabs] failed migration handled the wrong way

Mon Feb 1 05:05:34 EST 2021

On Mon, Feb 1, 2021 at 12:53 PM Ulrich Windl
<Ulrich.Windl at rz.uni-regensburg.de> wrote:
>
> Hi!
>
> While fighting to get the wrong configuration, I broke libvirt live-migration by not enabling the TLS socket.
>
> When testing to live-migrate a VM from h16 to h18, these are the essential events:
> Feb 01 10:30:10 h16 pacemaker-schedulerd[7466]:  notice:  * Move       prm_cron_snap_test-jeos5      ( h16 -> h18                                                                   )
> Feb 01 10:30:10 h16 pacemaker-schedulerd[7466]:  notice:  * Migrate    prm_xen_test-jeos5            ( h16 -> h18                                                                   )
> Feb 01 10:30:10 h16 pacemaker-execd[7464]:  notice: executing - rsc:prm_cron_snap_test-jeos5 action:stop call_id:                                                                  227
> Feb 01 10:30:10 h16 pacemaker-controld[7467]:  notice: Result of stop operation for prm_cron_snap_test-jeos5 on h                                                                  16: ok
> Feb 01 10:30:10 h16 pacemaker-controld[7467]:  notice: Initiating migrate_to operation prm_xen_test-jeos5_migrate                                                                  _to_0 locally on h16
> Feb 01 10:30:10 h16 pacemaker-execd[7464]:  notice: executing - rsc:prm_xen_test-jeos5 action:migrate_to call_id:                                                                  229
> Feb 01 10:30:10 h16 VirtualDomain(prm_xen_test-jeos5)[33121]: INFO: test-jeos5: Starting live migration to h18 (u                                                                  sing: virsh --connect=xen:///system --quiet migrate --live  test-jeos5 xen+tls://h18.sap.klinik.uni-regensburg.de                                                                   ).
> Feb 01 10:30:11 h16 libvirtd[28145]: internal error: Failed to send migration data to destination host
> Feb 01 10:30:11 h16 VirtualDomain(prm_xen_test-jeos5)[33132]: ERROR: test-jeos5: live migration to h18 failed: 1
> Feb 01 10:30:11 h16 pacemaker-execd[7464]:  notice: prm_xen_test-jeos5_migrate_to_0[33079] error output [ error:                                                                   internal error: Failed to send migration data to destination host ]
> Feb 01 10:30:11 h16 pacemaker-execd[7464]:  notice: prm_xen_test-jeos5_migrate_to_0[33079] error output [ ocf-exi                                                                  t-reason:test-jeos5: live migration to h18 failed: 1 ]
> Feb 01 10:30:11 h16 pacemaker-execd[7464]:  notice: prm_xen_test-jeos5 migrate_to (call 229, PID 33079) exited wi                                                                  th status 1 (execution time 693ms, queue time 0ms)
> Feb 01 10:30:11 h16 pacemaker-controld[7467]:  notice: Result of migrate_to operation for prm_xen_test-jeos5 on h                                                                  16: error
> Feb 01 10:30:11 h16 pacemaker-controld[7467]:  notice: h16-prm_xen_test-jeos5_migrate_to_0:229 [ error: internal                                                                   error: Failed to send migration data to destination host\nocf-exit-reason:test-jeos5: live migration to h18 faile                                                                  d: 1\n ]
> Feb 01 10:30:11 h16 pacemaker-controld[7467]:  notice: Transition 743 aborted by operation prm_xen_test-jeos5_mig                                                                  rate_to_0 'modify' on h16: Event failed
> Feb 01 10:30:11 h16 pacemaker-controld[7467]:  notice: Transition 743 action 120 (prm_xen_test-jeos5_migrate_to_0                                                                   on h16): expected 'ok' but got 'error'
> Feb 01 10:30:11 h16 pacemaker-controld[7467]:  notice: Transition 743 (Complete=4, Pending=0, Fired=0, Skipped=0,                                                                   Incomplete=7, Source=/var/lib/pacemaker/pengine/pe-input-84.bz2): Complete
> Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]:  warning: Unexpected result (error: test-jeos5: live migration to                                                                   h18 failed: 1) was recorded for migrate_to of prm_xen_test-jeos5 on h16 at Feb  1 10:30:10 2021
> Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]:  warning: Unexpected result (error: test-jeos5: live migration to                                                                   h18 failed: 1) was recorded for migrate_to of prm_xen_test-jeos5 on h16 at Feb  1 10:30:10 2021
> Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]:  error: Resource prm_xen_test-jeos5 is active on 2 nodes (attempt                                                                  ing recovery)
> Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]:  notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_is_To                                                                  o_Active for more information
> Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]:  notice:  * Start      prm_cron_snap_test-jeos5      (                                                                               h18 )
> Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]:  notice:  * Recover    prm_xen_test-jeos5            (                                                                               h18 )
> Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]:  error: Calculated transition 744 (with errors), saving inputs in                                                                   /var/lib/pacemaker/pengine/pe-error-6.bz2
> Feb 01 10:30:11 h16 pacemaker-controld[7467]:  notice: Initiating stop operation prm_xen_test-jeos5_stop_0 on h18
> Feb 01 10:30:11 h16 pacemaker-controld[7467]:  notice: Initiating stop operation prm_xen_test-jeos5_stop_0 locall                                                                  y on h16
> Feb 01 10:30:11 h16 pacemaker-execd[7464]:  notice: executing - rsc:prm_xen_test-jeos5 action:stop call_id:230
> Feb 01 10:30:11 h16 VirtualDomain(prm_xen_test-jeos5)[33170]: INFO: Issuing graceful shutdown request for domain                                                                   test-jeos5.
> Feb 01 10:31:11 h16 libvirtd[28145]: internal error: Failed to shutdown domain '13' with libxenlight
> Feb 01 10:33:06 h16 VirtualDomain(prm_xen_test-jeos5)[34194]: INFO: Issuing forced shutdown (destroy) request for                                                                   domain test-jeos5.
> Feb 01 10:33:08 h16 pacemaker-execd[7464]:  notice: prm_xen_test-jeos5_stop_0[33137] error output [ error: Failed                                                                   to shutdown domain test-jeos5 ]
> Feb 01 10:33:08 h16 pacemaker-execd[7464]:  notice: prm_xen_test-jeos5_stop_0[33137] error output [ error: intern                                                                  al error: Failed to shutdown domain '13' with libxenlight ]
> Feb 01 10:33:08 h16 pacemaker-execd[7464]:  notice: prm_xen_test-jeos5_stop_0[33137] error output [  ]
> Feb 01 10:33:08 h16 pacemaker-execd[7464]:  notice: prm_xen_test-jeos5 stop (call 230, PID 33137) exited with sta                                                                  tus 0 (execution time 177112ms, queue time 0ms)
>
> ### Shouldn't the result be error?
>

If domain remained active, I would say yes. But do not forget that
failure to stop resources by default will kill the node.

> Fortunately locking prevented duplicate activation of h18:
> Feb 01 10:32:51 h18 systemd[1]: Started Virtualization daemon.
> Feb 01 10:32:52 h18 virtlockd[9904]: Requested operation is not valid: Lockspace for path /var/lib/libvirt/lockd/                                                                  files already exists
> Feb 01 10:32:52 h18 virtlockd[9904]: Requested operation is not valid: Lockspace for path /var/lib/libvirt/lockd/                                                                  lvmvolumes already exists
> Feb 01 10:32:52 h18 virtlockd[9904]: Requested operation is not valid: Lockspace for path /var/lib/libvirt/lockd/                                                                  scsivolumes already exists
>
> So the main issue seems that a failed forced stop returned "success", causing a "recoveer" on h18 while the VM still runs on h16.

No, "recover" was caused by failure to migrate. You told pacemaker
that you now want this VM on another host, and your wish was its
command - it attempted to fulfill it. It obviously needed to stop VM
on its current host before trying to (re-)start on a new home.

>
> h16:~ # rpm -qf /usr/lib/ocf/resource.d/heartbeat/VirtualDomain
> resource-agents-4.4.0+git57.70549516-3.12.1.x86_64
>
> (SLES15 SP2)
>
> Regards,
> Ulrich
>
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/