[ClusterLabs] Stopping all nodes causes servers to migrate

Tue Jan 26 11:24:44 EST 2021

On Tue, 26 Jan 2021 16:15:55 +0100
Tomas Jelinek <tojeline at redhat.com> wrote:

> Dne 25. 01. 21 v 17:01 Ken Gaillot napsal(a):
> > On Mon, 2021-01-25 at 09:51 +0100, Jehan-Guillaume de Rorthais wrote:
> >> Hi Digimer,
> >>
> >> On Sun, 24 Jan 2021 15:31:22 -0500
> >> Digimer <lists at alteeve.ca> wrote:
> >> [...]
> >>>   I had a test server (srv01-test) running on node 1 (el8-a01n01),
> >>> and on
> >>> node 2 (el8-a01n02) I ran 'pcs cluster stop --all'.
> >>>
> >>>    It appears like pacemaker asked the VM to migrate to node 2
> >>> instead of
> >>> stopping it. Once the server was on node 2, I couldn't use 'pcs
> >>> resource
> >>> disable <vm>' as it returned that that resource was unmanaged, and
> >>> the
> >>> cluster shut down was hung. When I directly stopped the VM and then
> >>> did
> >>> a 'pcs resource cleanup', the cluster shutdown completed.
> >>
> >> As actions during a cluster shutdown cannot be handled in the same
> >> transition
> >> for each nodes, I usually add a step to disable all resources using
> >> property
> >> "stop-all-resources" before shutting down the cluster:
> >>
> >>    pcs property set stop-all-resources=true
> >>    pcs cluster stop --all
> >>
> >> But it seems there's a very new cluster property to handle that
> >> (IIRC, one or
> >> two releases ago). Look at "shutdown-lock" doc:
> >>
> >>    [...]
> >>    some users prefer to make resources highly available only for
> >> failures, with
> >>    no recovery for clean shutdowns. If this option is true, resources
> >> active on a
> >>    node when it is cleanly shut down are kept "locked" to that node
> >> (not allowed
> >>    to run elsewhere) until they start again on that node after it
> >> rejoins (or
> >>    for at most shutdown-lock-limit, if set).
> >>    [...]
> >>
> >> [...]
> >>>    So as best as I can tell, pacemaker really did ask for a
> >>> migration. Is
> >>> this the case?
> >>
> >> AFAIK, yes, because each cluster shutdown request is handled
> >> independently at
> >> node level. There's a large door open for all kind of race conditions
> >> if
> >> requests are handled with some random lags on each nodes.
> > 
> > I'm going to guess that's what happened.
> > 
> > The basic issue is that there is no "cluster shutdown" in Pacemaker,
> > only "node shutdown". I'm guessing "pcs cluster stop --all" sends
> > shutdown requests for each node in sequence (probably via systemd), and
> > if the nodes are quick enough, one could start migrating off resources
> > before all the others get their shutdown request.
> 
> Pcs is doing its best to stop nodes in parallel. The first 
> implementation of this was done back in 2015:
> https://bugzilla.redhat.com/show_bug.cgi?id=1180506
> Since then, we moved to using curl for network communication, which also 
> handles parallel cluster stop. Obviously, this doesn't ensure the stop 
> command arrives to and is processed on all nodes at the exactly same time.
> 
> Basically, pcs sends 'stop pacemaker' request to all nodes in parallel 
> and waits for it to finish on all nodes. Then it sends 'stop corosync' 
> request to all nodes in parallel.

How about adding a step to set/remove "stop-all-resources" on cluster
shutdown/start ? This step could either be optional with a new cli argument, or
added when --all is given for these commands.

Thoughts?