[ClusterLabs] Antw: Salvaging aborted resource migration

Thu Sep 27 16:00:42 UTC 2018

Ken Gaillot <kgaillot at redhat.com> writes:

> On Thu, 2018-09-27 at 09:36 +0200, Ulrich Windl wrote:
> 
>> Obviously you violated the most important cluster rule that is "be
>> patient".  Maybe the next important is "Don't change the
>> configuration while the cluster is not in IDLE state" ;-)
>
> Agreed -- although even idle, removing a ban can result in a migration
> back (if something like stickiness doesn't prevent it).

I've got no problem with that in general.  However, I can't gurantee
that every configuration change happens in idle state, certain
operations (mostly resource additions) are done by several
administrators without synchronization, and of course asynchronous
cluster events can also happen any time.  So I have to ask: what are the
consequences of breaking this "impossible" rule?

> There's currently no way to tell pacemaker that an operation (i.e.
> migrate_from) is a no-op and can be ignored. If a migration is only
> partially completed, it has to be considered a failure and reverted.

OK.  Are there other complex operations which can "partially complete"
if a transition is aborted by some event?

Now let's suppose a pull migration scenario: migrate_to does nothing,
but in this tiny window a configuration change aborts the transition.
The resources would go through a full recovery (stop+start), right?
Now let's suppose migrate_from gets scheduled and starts performing the
migration.  Before it finishes, a configuration change aborts the
transition.  The cluster waits for the outstanding operation to finish,
doesn't it?  And if it finishes successfully, is the migration
considered complete requiring no recovery?

> I'm not sure why the reload was scheduled; I suspect it's a bug due to
> a restart being needed but no parameters having changed. There should
> be special handling for a partial migration to make the stop required.

Probably CLBZ#5309 again...  You debugged a pe-input file for me with a
similar issue almost exactly a year ago (thread subject "Pacemaker
resource parameter reload confusion").  Time to upgrade this cluster, I
guess.
-- 
Thanks,
Feri