[ClusterLabs] Salvaging aborted resource migration

Thu Sep 27 02:37:47 EDT 2018

Hi,

The current behavior of cancelled migration with Pacemaker 1.1.16 with a
resource implementing push migration:

# /usr/sbin/crm_resource --ban -r vm-conv-4

vhbl03 crmd[10017]:   notice: State transition S_IDLE -> S_POLICY_ENGINE
vhbl03 pengine[10016]:   notice: Migrate vm-conv-4#011(Started vhbl07 -> vhbl04)
vhbl03 crmd[10017]:   notice: Initiating migrate_to operation vm-conv-4_migrate_to_0 on vhbl07
vhbl03 pengine[10016]:   notice: Calculated transition 4633, saving inputs in /var/lib/pacemaker/pengine/pe-input-1069.bz2
[...]

At this point, with the migration still ongoing, I wanted to get rid of
the constraint:

# /usr/sbin/crm_resource --clear -r vm-conv-4

vhbl03 crmd[10017]:   notice: Transition aborted by deletion of rsc_location[@id='cli-ban-vm-conv-4-on-vhbl07']: Configuration change
vhbl07 crmd[10233]:   notice: Result of migrate_to operation for vm-conv-4 on vhbl07: 0 (ok)
vhbl03 crmd[10017]:   notice: Transition 4633 (Complete=6, Pending=0, Fired=0, Skipped=1, Incomplete=6, Source=/var/lib/pacemaker/pengine/pe-input-1069.bz2): Stopped
vhbl03 pengine[10016]:   notice: Resource vm-conv-4 can no longer migrate to vhbl04. Stopping on vhbl07 too
vhbl03 pengine[10016]:   notice: Reload  vm-conv-4#011(Started vhbl07)
vhbl03 pengine[10016]:   notice: Calculated transition 4634, saving inputs in /var/lib/pacemaker/pengine/pe-input-1070.bz2
vhbl03 crmd[10017]:   notice: Initiating stop operation vm-conv-4_stop_0 on vhbl07
vhbl03 crmd[10017]:   notice: Initiating stop operation vm-conv-4_stop_0 on vhbl04
vhbl03 crmd[10017]:   notice: Initiating reload operation vm-conv-4_reload_0 on vhbl04

This recovery was entirely unnecessary, as the resource successfully
migrated to vhbl04 (the migrate_from operation does nothing).  Pacemaker
does not know this, but is there a way to educate it?  I think in this
special case it is possible to redesign the agent making migrate_to a
no-op and doing everything in migrate_from, which would significantly
reduce the window between the start points of the two "halfs", but I'm
not sure that would help in the end: Pacemaker could still decide to do
an unnecessary stop+start recovery.  Would it?  I failed to find any
documentation on recovery from aborted migration transitions.  I don't
expect on-fail (for migrate_* ops, not me) to apply here, does it?

Side question: why initiate a reload in any case, like above?

Even more side question: could you please consider using space instead
of TAB in syslog messages?  (Actually, I wouldn't mind getting rid of
them altogether in any output.)
-- 
Thanks,
Feri