[ClusterLabs] Stopping all nodes causes servers to migrate

Mon Jan 25 11:01:39 EST 2021

On Mon, 2021-01-25 at 09:51 +0100, Jehan-Guillaume de Rorthais wrote:
> Hi Digimer,
> 
> On Sun, 24 Jan 2021 15:31:22 -0500
> Digimer <lists at alteeve.ca> wrote:
> [...]
> >  I had a test server (srv01-test) running on node 1 (el8-a01n01),
> > and on
> > node 2 (el8-a01n02) I ran 'pcs cluster stop --all'.
> > 
> >   It appears like pacemaker asked the VM to migrate to node 2
> > instead of
> > stopping it. Once the server was on node 2, I couldn't use 'pcs
> > resource
> > disable <vm>' as it returned that that resource was unmanaged, and
> > the
> > cluster shut down was hung. When I directly stopped the VM and then
> > did
> > a 'pcs resource cleanup', the cluster shutdown completed.
> 
> As actions during a cluster shutdown cannot be handled in the same
> transition
> for each nodes, I usually add a step to disable all resources using
> property
> "stop-all-resources" before shutting down the cluster:
> 
>   pcs property set stop-all-resources=true
>   pcs cluster stop --all
> 
> But it seems there's a very new cluster property to handle that
> (IIRC, one or
> two releases ago). Look at "shutdown-lock" doc:
> 
>   [...]
>   some users prefer to make resources highly available only for
> failures, with
>   no recovery for clean shutdowns. If this option is true, resources
> active on a
>   node when it is cleanly shut down are kept "locked" to that node
> (not allowed
>   to run elsewhere) until they start again on that node after it
> rejoins (or
>   for at most shutdown-lock-limit, if set).
>   [...]
> 
> [...]
> >   So as best as I can tell, pacemaker really did ask for a
> > migration. Is
> > this the case?
> 
> AFAIK, yes, because each cluster shutdown request is handled
> independently at
> node level. There's a large door open for all kind of race conditions
> if
> requests are handled with some random lags on each nodes.

I'm going to guess that's what happened.

The basic issue is that there is no "cluster shutdown" in Pacemaker,
only "node shutdown". I'm guessing "pcs cluster stop --all" sends
shutdown requests for each node in sequence (probably via systemd), and
if the nodes are quick enough, one could start migrating off resources
before all the others get their shutdown request.

There would be a way around it. Normally Pacemaker is shut down via
SIGTERM to pacemakerd (which is what systemctl stop does), but inside
Pacemaker it's implemented as a special "shutdown" transient node
attribute, set to the epoch timestamp of the request. It would be
possible to set that attribute for all nodes in a copy of the CIB, then
load that into the live cluster.

stop-all-resources as suggested would be another way around it (and
would have to be cleared after start-up, which could be a plus or a
minus depending on how much control vs convenience you want).
-- 
Ken Gaillot <kgaillot at redhat.com>