[ClusterLabs] Antw: Salvaging aborted resource migration

Thu Sep 27 14:31:32 UTC 2018

On Thu, 2018-09-27 at 09:36 +0200, Ulrich Windl wrote:
> Hi!
> 
> Obviously you violated the most important cluster rule that is "be
> patient".
> Maybe the next important is "Don't change the configuration while the
> cluster
> is not in IDLE state" ;-)

Agreed -- although even idle, removing a ban can result in a migration
back (if something like stickiness doesn't prevent it).

There's currently no way to tell pacemaker that an operation (i.e.
migrate_from) is a no-op and can be ignored. If a migration is only
partially completed, it has to be considered a failure and reverted.

I'm not sure why the reload was scheduled; I suspect it's a bug due to
a restart being needed but no parameters having changed. There should
be special handling for a partial migration to make the stop required.

> I feel these are issues that should be fixed, but the above rules
> make your
> life easier while these issues still exist.
> 
> Regards,
> Ulrich
> 
> > > > Ferenc Wágner <wagner.ferenc at kifu.gov.hu> schrieb am 27.09.2018
> > > > um 08:37
> 
> in
> Nachricht <87tvmb5ttw.fsf at lant.ki.iif.hu>:
> > Hi,
> > 
> > The current behavior of cancelled migration with Pacemaker 1.1.16
> > with a
> > resource implementing push migration:
> > 
> > # /usr/sbin/crm_resource ‑‑ban ‑r vm‑conv‑4
> > 
> > vhbl03 crmd[10017]:   notice: State transition S_IDLE ‑>
> > S_POLICY_ENGINE
> > vhbl03 pengine[10016]:   notice: Migrate vm‑conv‑4#011(Started
> > vhbl07 ‑>
> 
> vhbl04)
> > vhbl03 crmd[10017]:   notice: Initiating migrate_to operation 
> > vm‑conv‑4_migrate_to_0 on vhbl07
> > vhbl03 pengine[10016]:   notice: Calculated transition 4633, saving
> > inputs 
> > in /var/lib/pacemaker/pengine/pe‑input‑1069.bz2
> > [...]
> > 
> > At this point, with the migration still ongoing, I wanted to get
> > rid of
> > the constraint:
> > 
> > # /usr/sbin/crm_resource ‑‑clear ‑r vm‑conv‑4
> > 
> > vhbl03 crmd[10017]:   notice: Transition aborted by deletion of 
> > rsc_location[@id='cli‑ban‑vm‑conv‑4‑on‑vhbl07']: Configuration
> > change
> > vhbl07 crmd[10233]:   notice: Result of migrate_to operation for
> > vm‑conv‑4
> 
> on 
> > vhbl07: 0 (ok)
> > vhbl03 crmd[10017]:   notice: Transition 4633 (Complete=6,
> > Pending=0, 
> > Fired=0, Skipped=1, Incomplete=6, 
> > Source=/var/lib/pacemaker/pengine/pe‑input‑1069.bz2): Stopped
> > vhbl03 pengine[10016]:   notice: Resource vm‑conv‑4 can no longer
> > migrate to
> > vhbl04. Stopping on vhbl07 too
> > vhbl03 pengine[10016]:   notice: Reload  vm‑conv‑4#011(Started
> > vhbl07)
> > vhbl03 pengine[10016]:   notice: Calculated transition 4634, saving
> > inputs 
> > in /var/lib/pacemaker/pengine/pe‑input‑1070.bz2
> > vhbl03 crmd[10017]:   notice: Initiating stop operation
> > vm‑conv‑4_stop_0 on
> > vhbl07
> > vhbl03 crmd[10017]:   notice: Initiating stop operation
> > vm‑conv‑4_stop_0 on
> > vhbl04
> > vhbl03 crmd[10017]:   notice: Initiating reload operation
> > vm‑conv‑4_reload_0
> > on vhbl04
> > 
> > This recovery was entirely unnecessary, as the resource
> > successfully
> > migrated to vhbl04 (the migrate_from operation does
> > nothing).  Pacemaker
> > does not know this, but is there a way to educate it?  I think in
> > this
> > special case it is possible to redesign the agent making migrate_to
> > a
> > no‑op and doing everything in migrate_from, which would
> > significantly
> > reduce the window between the start points of the two "halfs", but
> > I'm
> > not sure that would help in the end: Pacemaker could still decide
> > to do
> > an unnecessary stop+start recovery.  Would it?  I failed to find
> > any
> > documentation on recovery from aborted migration transitions.  I
> > don't
> > expect on‑fail (for migrate_* ops, not me) to apply here, does it?
> > 
> > Side question: why initiate a reload in any case, like above?
> > 
> > Even more side question: could you please consider using space
> > instead
> > of TAB in syslog messages?  (Actually, I wouldn't mind getting rid
> > of
> > them altogether in any output.)
> > ‑‑ 
> > Thanks,
> > Feri
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org 
> > https://lists.clusterlabs.org/mailman/listinfo/users 
> > 
> > Project Home: http://www.clusterlabs.org 
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
> > h.pdf 
> > Bugs: http://bugs.clusterlabs.org 
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot <kgaillot at redhat.com>