[ClusterLabs] failed migration handled the wrong way
Andrei Borzenkov
arvidjaar at gmail.com
Mon Feb 1 05:05:34 EST 2021
On Mon, Feb 1, 2021 at 12:53 PM Ulrich Windl
<Ulrich.Windl at rz.uni-regensburg.de> wrote:
>
> Hi!
>
> While fighting to get the wrong configuration, I broke libvirt live-migration by not enabling the TLS socket.
>
> When testing to live-migrate a VM from h16 to h18, these are the essential events:
> Feb 01 10:30:10 h16 pacemaker-schedulerd[7466]: notice: * Move prm_cron_snap_test-jeos5 ( h16 -> h18 )
> Feb 01 10:30:10 h16 pacemaker-schedulerd[7466]: notice: * Migrate prm_xen_test-jeos5 ( h16 -> h18 )
> Feb 01 10:30:10 h16 pacemaker-execd[7464]: notice: executing - rsc:prm_cron_snap_test-jeos5 action:stop call_id: 227
> Feb 01 10:30:10 h16 pacemaker-controld[7467]: notice: Result of stop operation for prm_cron_snap_test-jeos5 on h 16: ok
> Feb 01 10:30:10 h16 pacemaker-controld[7467]: notice: Initiating migrate_to operation prm_xen_test-jeos5_migrate _to_0 locally on h16
> Feb 01 10:30:10 h16 pacemaker-execd[7464]: notice: executing - rsc:prm_xen_test-jeos5 action:migrate_to call_id: 229
> Feb 01 10:30:10 h16 VirtualDomain(prm_xen_test-jeos5)[33121]: INFO: test-jeos5: Starting live migration to h18 (u sing: virsh --connect=xen:///system --quiet migrate --live test-jeos5 xen+tls://h18.sap.klinik.uni-regensburg.de ).
> Feb 01 10:30:11 h16 libvirtd[28145]: internal error: Failed to send migration data to destination host
> Feb 01 10:30:11 h16 VirtualDomain(prm_xen_test-jeos5)[33132]: ERROR: test-jeos5: live migration to h18 failed: 1
> Feb 01 10:30:11 h16 pacemaker-execd[7464]: notice: prm_xen_test-jeos5_migrate_to_0[33079] error output [ error: internal error: Failed to send migration data to destination host ]
> Feb 01 10:30:11 h16 pacemaker-execd[7464]: notice: prm_xen_test-jeos5_migrate_to_0[33079] error output [ ocf-exi t-reason:test-jeos5: live migration to h18 failed: 1 ]
> Feb 01 10:30:11 h16 pacemaker-execd[7464]: notice: prm_xen_test-jeos5 migrate_to (call 229, PID 33079) exited wi th status 1 (execution time 693ms, queue time 0ms)
> Feb 01 10:30:11 h16 pacemaker-controld[7467]: notice: Result of migrate_to operation for prm_xen_test-jeos5 on h 16: error
> Feb 01 10:30:11 h16 pacemaker-controld[7467]: notice: h16-prm_xen_test-jeos5_migrate_to_0:229 [ error: internal error: Failed to send migration data to destination host\nocf-exit-reason:test-jeos5: live migration to h18 faile d: 1\n ]
> Feb 01 10:30:11 h16 pacemaker-controld[7467]: notice: Transition 743 aborted by operation prm_xen_test-jeos5_mig rate_to_0 'modify' on h16: Event failed
> Feb 01 10:30:11 h16 pacemaker-controld[7467]: notice: Transition 743 action 120 (prm_xen_test-jeos5_migrate_to_0 on h16): expected 'ok' but got 'error'
> Feb 01 10:30:11 h16 pacemaker-controld[7467]: notice: Transition 743 (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=7, Source=/var/lib/pacemaker/pengine/pe-input-84.bz2): Complete
> Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]: warning: Unexpected result (error: test-jeos5: live migration to h18 failed: 1) was recorded for migrate_to of prm_xen_test-jeos5 on h16 at Feb 1 10:30:10 2021
> Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]: warning: Unexpected result (error: test-jeos5: live migration to h18 failed: 1) was recorded for migrate_to of prm_xen_test-jeos5 on h16 at Feb 1 10:30:10 2021
> Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]: error: Resource prm_xen_test-jeos5 is active on 2 nodes (attempt ing recovery)
> Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_is_To o_Active for more information
> Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]: notice: * Start prm_cron_snap_test-jeos5 ( h18 )
> Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]: notice: * Recover prm_xen_test-jeos5 ( h18 )
> Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]: error: Calculated transition 744 (with errors), saving inputs in /var/lib/pacemaker/pengine/pe-error-6.bz2
> Feb 01 10:30:11 h16 pacemaker-controld[7467]: notice: Initiating stop operation prm_xen_test-jeos5_stop_0 on h18
> Feb 01 10:30:11 h16 pacemaker-controld[7467]: notice: Initiating stop operation prm_xen_test-jeos5_stop_0 locall y on h16
> Feb 01 10:30:11 h16 pacemaker-execd[7464]: notice: executing - rsc:prm_xen_test-jeos5 action:stop call_id:230
> Feb 01 10:30:11 h16 VirtualDomain(prm_xen_test-jeos5)[33170]: INFO: Issuing graceful shutdown request for domain test-jeos5.
> Feb 01 10:31:11 h16 libvirtd[28145]: internal error: Failed to shutdown domain '13' with libxenlight
> Feb 01 10:33:06 h16 VirtualDomain(prm_xen_test-jeos5)[34194]: INFO: Issuing forced shutdown (destroy) request for domain test-jeos5.
> Feb 01 10:33:08 h16 pacemaker-execd[7464]: notice: prm_xen_test-jeos5_stop_0[33137] error output [ error: Failed to shutdown domain test-jeos5 ]
> Feb 01 10:33:08 h16 pacemaker-execd[7464]: notice: prm_xen_test-jeos5_stop_0[33137] error output [ error: intern al error: Failed to shutdown domain '13' with libxenlight ]
> Feb 01 10:33:08 h16 pacemaker-execd[7464]: notice: prm_xen_test-jeos5_stop_0[33137] error output [ ]
> Feb 01 10:33:08 h16 pacemaker-execd[7464]: notice: prm_xen_test-jeos5 stop (call 230, PID 33137) exited with sta tus 0 (execution time 177112ms, queue time 0ms)
>
> ### Shouldn't the result be error?
>
If domain remained active, I would say yes. But do not forget that
failure to stop resources by default will kill the node.
> Fortunately locking prevented duplicate activation of h18:
> Feb 01 10:32:51 h18 systemd[1]: Started Virtualization daemon.
> Feb 01 10:32:52 h18 virtlockd[9904]: Requested operation is not valid: Lockspace for path /var/lib/libvirt/lockd/ files already exists
> Feb 01 10:32:52 h18 virtlockd[9904]: Requested operation is not valid: Lockspace for path /var/lib/libvirt/lockd/ lvmvolumes already exists
> Feb 01 10:32:52 h18 virtlockd[9904]: Requested operation is not valid: Lockspace for path /var/lib/libvirt/lockd/ scsivolumes already exists
>
> So the main issue seems that a failed forced stop returned "success", causing a "recoveer" on h18 while the VM still runs on h16.
No, "recover" was caused by failure to migrate. You told pacemaker
that you now want this VM on another host, and your wish was its
command - it attempted to fulfill it. It obviously needed to stop VM
on its current host before trying to (re-)start on a new home.
>
> h16:~ # rpm -qf /usr/lib/ocf/resource.d/heartbeat/VirtualDomain
> resource-agents-4.4.0+git57.70549516-3.12.1.x86_64
>
> (SLES15 SP2)
>
> Regards,
> Ulrich
>
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
More information about the Users
mailing list