[ClusterLabs] live migration rarely fails seemingly without reason

Lentes, Bernd bernd.lentes at helmholtz-muenchen.de
Mon Dec 3 10:40:12 EST 2018


Hi,

i have a two node cluster with several VirtualDomains as resources. Normally live migration is no problem. But rarely it fails, without giving any reasonable 
message in the logs. I tried to migrate several VirtualDmains concurrently from ha-idg-2 to ha-idg-1. One VirtualDomain failed, the others suceeded.

This happened already once before, i *think* it was also when i tried to migrate several VirtualDomains concurrently, but i'm not complete sure (it's already
some time ago).

But the error appears rarely, it often suceeded without any problems. I *think* it's related to migrating several domains concurrently.

node 1:
2018-12-03T16:03:03.037872+01:00 ha-idg-1 crmd[8615]:  warning: Action 44 (vm_mausdb_migrate_to_0) on ha-idg-2 failed (target: 0 vs. rc: 1): Error
2018-12-03T16:03:03.038234+01:00 ha-idg-1 crmd[8615]:   notice: Transition 44 aborted by operation vm_mausdb_migrate_to_0 'modify' on ha-idg-2: Event failed
2018-12-03T16:03:03.038569+01:00 ha-idg-1 crmd[8615]:  warning: Action 44 (vm_mausdb_migrate_to_0) on ha-idg-2 failed (target: 0 vs. rc: 1): Error

2018-12-03T16:03:03.041771+01:00 ha-idg-1 stonith-ng[8611]:  warning: fence_ha-idg-2 has 'action' parameter, which should never be specified in configuration
2018-12-03T16:03:03.042112+01:00 ha-idg-1 stonith-ng[8611]:  warning: Mapping action='Off' to pcmk_off_action='Off'
2018-12-03T16:03:03.042365+01:00 ha-idg-1 stonith-ng[8611]:  warning: Mapping action='Off' to pcmk_reboot_action='Off'
2018-12-03T16:03:03.208895+01:00 ha-idg-1 sshd[6397]: Received disconnect from 146.107.235.132 port 46004:11: disconnected by user
2018-12-03T16:03:03.209367+01:00 ha-idg-1 sshd[6397]: Disconnected from 146.107.235.132 port 46004
2018-12-03T16:03:03.209670+01:00 ha-idg-1 sshd[6397]: pam_unix(sshd:session): session closed for user root
2018-12-03T16:03:03.213996+01:00 ha-idg-1 systemd-logind[1627]: Removed session 17.
2018-12-03T16:03:03.286986+01:00 ha-idg-1 crmd[8615]:   notice: Transition 44 (Complete=6, Pending=0, Fired=0, Skipped=1, Incomplete=9, Source=/var/lib/pacemaker/pengine/pe-input-346.bz2): Stopped
2018-12-03T16:03:03.290762+01:00 ha-idg-1 stonith-ng[8611]:  warning: fence_ha-idg-2 has 'action' parameter, which should never be specified in configuration
2018-12-03T16:03:03.291224+01:00 ha-idg-1 stonith-ng[8611]:  warning: Mapping action='Off' to pcmk_off_action='Off'
2018-12-03T16:03:03.291619+01:00 ha-idg-1 stonith-ng[8611]:  warning: Mapping action='Off' to pcmk_reboot_action='Off'
2018-12-03T16:03:03.313864+01:00 ha-idg-1 pengine[8614]:  warning: Processing failed op migrate_to for vm_mausdb on ha-idg-2: unknown error (1)
2018-12-03T16:03:03.314193+01:00 ha-idg-1 pengine[8614]:  warning: Processing failed op migrate_to for vm_mausdb on ha-idg-2: unknown error (1)
2018-12-03T16:03:03.316485+01:00 ha-idg-1 pengine[8614]:   notice: Recover vm_mausdb#011(Started ha-idg-2 -> ha-idg-1)
2018-12-03T16:03:03.316805+01:00 ha-idg-1 pengine[8614]:   notice: Migrate vm_geneious#011(Started ha-idg-2 -> ha-idg-1)
2018-12-03T16:03:03.317914+01:00 ha-idg-1 pengine[8614]:   notice: Calculated transition 45, saving inputs in /var/lib/pacemaker/pengine/pe-input-347.bz2

node2:
2018-12-03T16:02:32.312598+01:00 ha-idg-2 VirtualDomain(vm_mausdb)[7988]: INFO: mausdb: Starting live migration to ha-idg-1 (using: virsh --connect=qemu:///system --quiet migrate --live  ma
usdb qemu+ssh://ha-idg-1/system ).
2018-12-03T16:02:32.315358+01:00 ha-idg-2 VirtualDomain(vm_geneious)[7989]: INFO: geneious: Starting live migration to ha-idg-1 (using: virsh --connect=qemu:///system --quiet migrate --live
  geneious qemu+ssh://ha-idg-1/system ).
2018-12-03T16:02:32.391911+01:00 ha-idg-2 lrmd[14256]:   notice: executing - rsc:vm_sim action:stop call_id:78
2018-12-03T16:02:32.394065+01:00 ha-idg-2 stonith-ng[14255]:  warning: fence_ha-idg-1 has 'action' parameter, which should never be specified in configuration
2018-12-03T16:02:32.394369+01:00 ha-idg-2 stonith-ng[14255]:  warning: Mapping action='Off' to pcmk_off_action='Off'
2018-12-03T16:02:32.394619+01:00 ha-idg-2 stonith-ng[14255]:  warning: Mapping action='Off' to pcmk_reboot_action='Off'
2018-12-03T16:02:32.459622+01:00 ha-idg-2 VirtualDomain(vm_sim)[8599]: INFO: Domain sim already stopped.
2018-12-03T16:02:32.464042+01:00 ha-idg-2 lrmd[14256]:   notice: finished - rsc:vm_sim action:stop call_id:78 pid:8599 exit-code:0 exec-time:72ms queue-time:0ms
2018-12-03T16:02:32.464750+01:00 ha-idg-2 crmd[14259]:   notice: Result of stop operation for vm_sim on ha-idg-2: 0 (ok)
2018-12-03T16:02:32.470496+01:00 ha-idg-2 stonith-ng[14255]:  warning: fence_ha-idg-1 has 'action' parameter, which should never be specified in configuration
2018-12-03T16:02:32.470794+01:00 ha-idg-2 stonith-ng[14255]:  warning: Mapping action='Off' to pcmk_off_action='Off'
2018-12-03T16:02:32.471043+01:00 ha-idg-2 stonith-ng[14255]:  warning: Mapping action='Off' to pcmk_reboot_action='Off'
2018-12-03T16:02:32.628562+01:00 ha-idg-2 stonith-ng[14255]:  warning: fence_ha-idg-1 has 'action' parameter, which should never be specified in configuration
2018-12-03T16:02:32.628970+01:00 ha-idg-2 stonith-ng[14255]:  warning: Mapping action='Off' to pcmk_off_action='Off'
2018-12-03T16:02:32.629226+01:00 ha-idg-2 stonith-ng[14255]:  warning: Mapping action='Off' to pcmk_reboot_action='Off'
2018-12-03T16:03:02.836145+01:00 ha-idg-2 libvirtd[3117]: 2018-12-03 15:03:02.835+0000: 4515: error : qemuMigrationCheckJobStatus:1456 : operation failed: migration job: unexpectedly failed
2018-12-03T16:03:03.006918+01:00 ha-idg-2 VirtualDomain(vm_mausdb)[7988]: ERROR: mausdb: live migration to ha-idg-1 failed: 1
2018-12-03T16:03:03.032370+01:00 ha-idg-2 kernel: [ 6593.265436] br0: port 2(vnet0) entered disabled state
2018-12-03T16:03:03.032383+01:00 ha-idg-2 kernel: [ 6593.266268] device vnet0 left promiscuous mode
2018-12-03T16:03:03.032385+01:00 ha-idg-2 kernel: [ 6593.266275] br0: port 2(vnet0) entered disabled state
2018-12-03T16:03:03.032789+01:00 ha-idg-2 lrmd[14256]:   notice: vm_mausdb_migrate_to_0:7988:stderr [ error: operation failed: migration job: unexpectedly failed ]
2018-12-03T16:03:03.033126+01:00 ha-idg-2 lrmd[14256]:   notice: vm_mausdb_migrate_to_0:7988:stderr [ ocf-exit-reason:mausdb: live migration to ha-idg-1 failed: 1 ]
2018-12-03T16:03:03.033376+01:00 ha-idg-2 lrmd[14256]:   notice: finished - rsc:vm_mausdb action:migrate_to call_id:75 pid:7988 exit-code:1 exec-time:31065ms queue-time:0ms
2018-12-03T16:03:03.033909+01:00 ha-idg-2 crmd[14259]:   notice: Result of migrate_to operation for vm_mausdb on ha-idg-2: 1 (unknown error)
2018-12-03T16:03:03.034198+01:00 ha-idg-2 crmd[14259]:   notice: ha-idg-2-vm_mausdb_migrate_to_0:75 [ error: operation failed: migration job: unexpectedly failed\nocf-exit-reason:mausdb: live migration to ha-idg-1 failed: 1\n ]

/var/log/pacemaker.log:
Dec 03 16:02:32 [8611] ha-idg-1 stonith-ng:     info: cib_devices_update:       Updating devices to version 1.789.6
Dec 03 16:02:32 [8611] ha-idg-1 stonith-ng:  warning: xml2device_params:        fence_ha-idg-2 has 'action' parameter, which should never be specified in configuration
Dec 03 16:02:32 [8611] ha-idg-1 stonith-ng:  warning: map_action:       Mapping action='Off' to pcmk_off_action='Off'
Dec 03 16:02:32 [8611] ha-idg-1 stonith-ng:  warning: map_action:       Mapping action='Off' to pcmk_reboot_action='Off'
Dec 03 16:02:32 [8611] ha-idg-1 stonith-ng:     info: cib_device_update:        Device fence_ha-idg-1 has been disabled on ha-idg-1: score=-INFINITY
Dec 03 16:02:37 [8610] ha-idg-1        cib:     info: cib_process_ping: Reporting our current digest to ha-idg-1: 2715b3a8fd056fbd4d84cb12b911a328 for 1.789.6 (0x1aebf00 0)
Dec 03 16:03:03 [8610] ha-idg-1        cib:     info: cib_perform_op:   Diff: --- 1.789.6 2
Dec 03 16:03:03 [8610] ha-idg-1        cib:     info: cib_perform_op:   Diff: +++ 1.789.7 (null)
Dec 03 16:03:03 [8610] ha-idg-1        cib:     info: cib_perform_op:   +  /cib:  @num_updates=7
Dec 03 16:03:03 [8610] ha-idg-1        cib:     info: cib_perform_op:   +  /cib/status/node_state[@id='1084777492']/lrm[@id='1084777492']/lrm_resources/lrm_resource[@id='vm_mausdb']/lrm_rsc
_op[@id='vm_mausdb_last_0']:  @transition-magic=0:1;44:44:0:e4b76be7-fdb6-47c1-a905-9a49650b4180, @exit-reason=mausdb: live migration to ha-idg-1 failed: 1, @call-id=75, @rc-code=1, @op-sta
tus=0, @exec-time=31065
Dec 03 16:03:03 [8610] ha-idg-1        cib:     info: cib_perform_op:   ++ /cib/status/node_state[@id='1084777492']/lrm[@id='1084777492']/lrm_resources/lrm_resource[@id='vm_mausdb']:  <lrm_
rsc_op id="vm_mausdb_last_failure_0" operation_key="vm_mausdb_migrate_to_0" operation="migrate_to" crm-debug-origin="do_update_resource" crm_feature_set="3.0.13" transition-key="44:44:0:e4b
76be7-fdb6-47c1-a905-9a49650b4180" transition-magic="0:1;44:44:0:e4b76be7-fdb6-47c1-a905-9a49650b4180" exit-reason="mausdb: live migrat
Dec 03 16:03:03 [8610] ha-idg-1        cib:     info: cib_process_request:      Completed cib_modify operation for section status: OK (rc=0, origin=ha-idg-2/crmd/74, version=1.789.7)
Dec 03 16:03:03 [8615] ha-idg-1       crmd:  warning: status_from_rc:   Action 44 (vm_mausdb_migrate_to_0) on ha-idg-2 failed (target: 0 vs. rc: 1): Error
Dec 03 16:03:03 [8615] ha-idg-1       crmd:   notice: abort_transition_graph:   Transition 44 aborted by operation vm_mausdb_migrate_to_0 'modify' on ha-idg-2: Event failed | magic=0:1;44:4
4:0:e4b76be7-fdb6-47c1-a905-9a49650b4180 cib=1.789.7 source=match_graph_event:310 complete=false
Dec 03 16:03:03 [8615] ha-idg-1       crmd:     info: match_graph_event:        Action vm_mausdb_migrate_to_0 (44) confirmed on ha-idg-2 (rc=1)
Dec 03 16:03:03 [8615] ha-idg-1       crmd:     info: process_graph_event:      Detected action (44.44) vm_mausdb_migrate_to_0.75=unknown error: failed
Dec 03 16:03:03 [8615] ha-idg-1       crmd:  warning: status_from_rc:   Action 44 (vm_mausdb_migrate_to_0) on ha-idg-2 failed (target: 0 vs. rc: 1): Error
Dec 03 16:03:03 [8615] ha-idg-1       crmd:     info: abort_transition_graph:   Transition 44 aborted by operation vm_mausdb_migrate_to_0 'create' on ha-idg-2: Event failed | magic=0:1;44:4
4:0:e4b76be7-fdb6-47c1-a905-9a49650b4180 cib=1.789.7 source=match_graph_event:310 complete=false
Dec 03 16:03:03 [8615] ha-idg-1       crmd:     info: match_graph_event:        Action vm_mausdb_migrate_to_0 (44) confirmed on ha-idg-2 (rc=1)
Dec 03 16:03:03 [8615] ha-idg-1       crmd:     info: process_graph_event:      Detected action (44.44) vm_mausdb_migrate_to_0.75=unknown error: failed
Dec 03 16:03:03 [8611] ha-idg-1 stonith-ng:     info: update_cib_stonith_devices_v2:    Updating device list from the cib: modify lrm_rsc_op[@id='vm_mausdb_last_0']

Hosts are SLES 12 SP3, pacenaker is the most recent for my OS: pacemaker-1.1.16-6.5.1.x86_64

I don't have any clue, i'm thankful for any hint. I can provide you with further information if needed.


Bernd
-- 

Bernd Lentes 
Systemadministration 
Institut für Entwicklungsgenetik 
Gebäude 35.34 - Raum 208 
HelmholtzZentrum münchen 
[ mailto:bernd.lentes at helmholtz-muenchen.de | bernd.lentes at helmholtz-muenchen.de ] 
phone: +49 89 3187 1241 
fax: +49 89 3187 2294 
[ http://www.helmholtz-muenchen.de/idg | http://www.helmholtz-muenchen.de/idg ] 

wer Fehler macht kann etwas lernen 
wer nichts macht kann auch nichts lernen
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDirig.in Petra Steiner-Hoffmann
Stellv.Aufsichtsratsvorsitzender: MinDirig. Dr. Manfred Wolter
Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler, Dr. rer. nat. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671



More information about the Users mailing list