[ClusterLabs] failed migration handled the wrong way
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Mon Feb 1 04:53:17 EST 2021
Hi!
While fighting to get the wrong configuration, I broke libvirt live-migration by not enabling the TLS socket.
When testing to live-migrate a VM from h16 to h18, these are the essential events:
Feb 01 10:30:10 h16 pacemaker-schedulerd[7466]: notice: * Move prm_cron_snap_test-jeos5 ( h16 -> h18 )
Feb 01 10:30:10 h16 pacemaker-schedulerd[7466]: notice: * Migrate prm_xen_test-jeos5 ( h16 -> h18 )
Feb 01 10:30:10 h16 pacemaker-execd[7464]: notice: executing - rsc:prm_cron_snap_test-jeos5 action:stop call_id: 227
Feb 01 10:30:10 h16 pacemaker-controld[7467]: notice: Result of stop operation for prm_cron_snap_test-jeos5 on h 16: ok
Feb 01 10:30:10 h16 pacemaker-controld[7467]: notice: Initiating migrate_to operation prm_xen_test-jeos5_migrate _to_0 locally on h16
Feb 01 10:30:10 h16 pacemaker-execd[7464]: notice: executing - rsc:prm_xen_test-jeos5 action:migrate_to call_id: 229
Feb 01 10:30:10 h16 VirtualDomain(prm_xen_test-jeos5)[33121]: INFO: test-jeos5: Starting live migration to h18 (u sing: virsh --connect=xen:///system --quiet migrate --live test-jeos5 xen+tls://h18.sap.klinik.uni-regensburg.de ).
Feb 01 10:30:11 h16 libvirtd[28145]: internal error: Failed to send migration data to destination host
Feb 01 10:30:11 h16 VirtualDomain(prm_xen_test-jeos5)[33132]: ERROR: test-jeos5: live migration to h18 failed: 1
Feb 01 10:30:11 h16 pacemaker-execd[7464]: notice: prm_xen_test-jeos5_migrate_to_0[33079] error output [ error: internal error: Failed to send migration data to destination host ]
Feb 01 10:30:11 h16 pacemaker-execd[7464]: notice: prm_xen_test-jeos5_migrate_to_0[33079] error output [ ocf-exi t-reason:test-jeos5: live migration to h18 failed: 1 ]
Feb 01 10:30:11 h16 pacemaker-execd[7464]: notice: prm_xen_test-jeos5 migrate_to (call 229, PID 33079) exited wi th status 1 (execution time 693ms, queue time 0ms)
Feb 01 10:30:11 h16 pacemaker-controld[7467]: notice: Result of migrate_to operation for prm_xen_test-jeos5 on h 16: error
Feb 01 10:30:11 h16 pacemaker-controld[7467]: notice: h16-prm_xen_test-jeos5_migrate_to_0:229 [ error: internal error: Failed to send migration data to destination host\nocf-exit-reason:test-jeos5: live migration to h18 faile d: 1\n ]
Feb 01 10:30:11 h16 pacemaker-controld[7467]: notice: Transition 743 aborted by operation prm_xen_test-jeos5_mig rate_to_0 'modify' on h16: Event failed
Feb 01 10:30:11 h16 pacemaker-controld[7467]: notice: Transition 743 action 120 (prm_xen_test-jeos5_migrate_to_0 on h16): expected 'ok' but got 'error'
Feb 01 10:30:11 h16 pacemaker-controld[7467]: notice: Transition 743 (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=7, Source=/var/lib/pacemaker/pengine/pe-input-84.bz2): Complete
Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]: warning: Unexpected result (error: test-jeos5: live migration to h18 failed: 1) was recorded for migrate_to of prm_xen_test-jeos5 on h16 at Feb 1 10:30:10 2021
Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]: warning: Unexpected result (error: test-jeos5: live migration to h18 failed: 1) was recorded for migrate_to of prm_xen_test-jeos5 on h16 at Feb 1 10:30:10 2021
Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]: error: Resource prm_xen_test-jeos5 is active on 2 nodes (attempt ing recovery)
Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_is_To o_Active for more information
Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]: notice: * Start prm_cron_snap_test-jeos5 ( h18 )
Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]: notice: * Recover prm_xen_test-jeos5 ( h18 )
Feb 01 10:30:11 h16 pacemaker-schedulerd[7466]: error: Calculated transition 744 (with errors), saving inputs in /var/lib/pacemaker/pengine/pe-error-6.bz2
Feb 01 10:30:11 h16 pacemaker-controld[7467]: notice: Initiating stop operation prm_xen_test-jeos5_stop_0 on h18
Feb 01 10:30:11 h16 pacemaker-controld[7467]: notice: Initiating stop operation prm_xen_test-jeos5_stop_0 locall y on h16
Feb 01 10:30:11 h16 pacemaker-execd[7464]: notice: executing - rsc:prm_xen_test-jeos5 action:stop call_id:230
Feb 01 10:30:11 h16 VirtualDomain(prm_xen_test-jeos5)[33170]: INFO: Issuing graceful shutdown request for domain test-jeos5.
Feb 01 10:31:11 h16 libvirtd[28145]: internal error: Failed to shutdown domain '13' with libxenlight
Feb 01 10:33:06 h16 VirtualDomain(prm_xen_test-jeos5)[34194]: INFO: Issuing forced shutdown (destroy) request for domain test-jeos5.
Feb 01 10:33:08 h16 pacemaker-execd[7464]: notice: prm_xen_test-jeos5_stop_0[33137] error output [ error: Failed to shutdown domain test-jeos5 ]
Feb 01 10:33:08 h16 pacemaker-execd[7464]: notice: prm_xen_test-jeos5_stop_0[33137] error output [ error: intern al error: Failed to shutdown domain '13' with libxenlight ]
Feb 01 10:33:08 h16 pacemaker-execd[7464]: notice: prm_xen_test-jeos5_stop_0[33137] error output [ ]
Feb 01 10:33:08 h16 pacemaker-execd[7464]: notice: prm_xen_test-jeos5 stop (call 230, PID 33137) exited with sta tus 0 (execution time 177112ms, queue time 0ms)
### Shouldn't the result be error?
Fortunately locking prevented duplicate activation of h18:
Feb 01 10:32:51 h18 systemd[1]: Started Virtualization daemon.
Feb 01 10:32:52 h18 virtlockd[9904]: Requested operation is not valid: Lockspace for path /var/lib/libvirt/lockd/ files already exists
Feb 01 10:32:52 h18 virtlockd[9904]: Requested operation is not valid: Lockspace for path /var/lib/libvirt/lockd/ lvmvolumes already exists
Feb 01 10:32:52 h18 virtlockd[9904]: Requested operation is not valid: Lockspace for path /var/lib/libvirt/lockd/ scsivolumes already exists
So the main issue seems that a failed forced stop returned "success", causing a "recoveer" on h18 while the VM still runs on h16.
h16:~ # rpm -qf /usr/lib/ocf/resource.d/heartbeat/VirtualDomain
resource-agents-4.4.0+git57.70549516-3.12.1.x86_64
(SLES15 SP2)
Regards,
Ulrich
More information about the Users
mailing list