[ClusterLabs] Antw: [EXT] Re: Stopping all nodes causes servers to migrate

Wed Jan 27 12:46:37 EST 2021

On Wed, 2021-01-27 at 08:35 +0100, Ulrich Windl wrote:
> > > > Tomas Jelinek <tojeline at redhat.com> schrieb am 26.01.2021 um
> > > > 16:15 in
> 
> Nachricht
> <48f935a5-184f-d2d7-7f1a-db596aa6c72c at redhat.com>:
> > Dne 25. 01. 21 v 17:01 Ken Gaillot napsal(a):
> > > On Mon, 2021‑01‑25 at 09:51 +0100, Jehan‑Guillaume de Rorthais
> > > wrote:
> > > > Hi Digimer,
> > > > 
> > > > On Sun, 24 Jan 2021 15:31:22 ‑0500
> > > > Digimer <lists at alteeve.ca> wrote:
> > > > [...]
> > > > >   I had a test server (srv01‑test) running on node 1
> > > > > (el8‑a01n01),
> > > > > and on
> > > > > node 2 (el8‑a01n02) I ran 'pcs cluster stop ‑‑all'.
> > > > > 
> > > > >    It appears like pacemaker asked the VM to migrate to node
> > > > > 2
> > > > > instead of
> > > > > stopping it. Once the server was on node 2, I couldn't use
> > > > > 'pcs
> > > > > resource
> > > > > disable <vm>' as it returned that that resource was
> > > > > unmanaged, and
> > > > > the
> > > > > cluster shut down was hung. When I directly stopped the VM
> > > > > and then
> > > > > did
> > > > > a 'pcs resource cleanup', the cluster shutdown completed.
> > > > 
> > > > As actions during a cluster shutdown cannot be handled in the
> > > > same
> > > > transition
> > > > for each nodes, I usually add a step to disable all resources
> > > > using
> > > > property
> > > > "stop‑all‑resources" before shutting down the cluster:
> > > > 
> > > >    pcs property set stop‑all‑resources=true
> > > >    pcs cluster stop ‑‑all
> > > > 
> > > > But it seems there's a very new cluster property to handle that
> > > > (IIRC, one or
> > > > two releases ago). Look at "shutdown‑lock" doc:
> > > > 
> > > >    [...]
> > > >    some users prefer to make resources highly available only
> > > > for
> > > > failures, with
> > > >    no recovery for clean shutdowns. If this option is true,
> > > > resources
> > > > active on a
> > > >    node when it is cleanly shut down are kept "locked" to that
> > > > node
> > > > (not allowed
> > > >    to run elsewhere) until they start again on that node after
> > > > it
> > > > rejoins (or
> > > >    for at most shutdown‑lock‑limit, if set).
> > > >    [...]
> > > > 
> > > > [...]
> > > > >    So as best as I can tell, pacemaker really did ask for a
> > > > > migration. Is
> > > > > this the case?
> > > > 
> > > > AFAIK, yes, because each cluster shutdown request is handled
> > > > independently at
> > > > node level. There's a large door open for all kind of race
> > > > conditions
> > > > if
> > > > requests are handled with some random lags on each nodes.
> > > 
> > > I'm going to guess that's what happened.
> > > 
> > > The basic issue is that there is no "cluster shutdown" in
> > > Pacemaker,
> > > only "node shutdown". I'm guessing "pcs cluster stop ‑‑all" sends
> > > shutdown requests for each node in sequence (probably via
> > > systemd), and
> > > if the nodes are quick enough, one could start migrating off
> > > resources
> > > before all the others get their shutdown request.
> > 
> > Pcs is doing its best to stop nodes in parallel. The first 
> > implementation of this was done back in 2015:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1180506 
> > Since then, we moved to using curl for network communication, which
> > also 
> > handles parallel cluster stop. Obviously, this doesn't ensure the
> > stop 
> > command arrives to and is processed on all nodes at the exactly
> > same time.
> > 
> > Basically, pcs sends 'stop pacemaker' request to all nodes in
> > parallel 
> > and waits for it to finish on all nodes. Then it sends 'stop
> > corosync' 
> > request to all nodes in parallel. The actual stopping on each node
> > is 
> > done by 'systemctl stop'.
> 
> Hi!
> 
> I wonder: Is there actually a "stop node" command in the
> communication
> protocol, or doe just just kill the crmd remotely?
> In the first case (command exists), we would only neeed a gouping for
> multiple
> commands, and we'd have a cluster shutdown:
> One node sends a group of commands to stop every node. The nodes
> acknowledge
> and then begin to stop...
> (A "group of commands" is like a single database transaction
> containing
> multiple changes)
> 
> Regards,
> Ulrich

This is the current sequence of a clean shutdown for one node:

1. Someone or something (e.g. systemctl stop) sends SIGTERM to
pacemakerd on the node to be shut down.

2. pacemakerd relays that signal to all the subdaemons on the node and
waits for them to exit before exiting itself.

3. When the controller gets the SIGTERM, it sends a shutdown request to
the controller on the DC node.

4. When the DC receives the node's shutdown request, it sets a
"shutdown" node attribute for the node and invokes the scheduler, which
schedules all appropriate actions (stopping or moving resources, etc.).

6. The DC coordinates all the necessary actions that were scheduled,
then sends a confirmation to the node that requested shutdown.

7. When the controller receives the confirmation, it exits.

So ...

A "shut down the whole cluster" command should be possible, but the
process would need significant redesign. Currently a node has to
initiate its own shutdown, because the local pacemakerd and controller
have to be aware it's happening.

I envision a new controller API request for cluster shutdown that would
be relayed to all controllers, and each controller would send SIGTERM
to the local pacemakerd. The DC would additionally set the shutdown
attribute for all nodes at once and invoke the scheduler. Timing and
corner cases would require a lot of attention (no DC elected, any node
crashing at any point in the process, etc.).

It should be feasible, someone would just need time to do it.

> > Yes, the nodes which get the request sooner may start migrating
> > resources.
> > 
> > Regards,
> > Tomas
> > 
> > > 
> > > There would be a way around it. Normally Pacemaker is shut down
> > > via
> > > SIGTERM to pacemakerd (which is what systemctl stop does), but
> > > inside
> > > Pacemaker it's implemented as a special "shutdown" transient node
> > > attribute, set to the epoch timestamp of the request. It would be
> > > possible to set that attribute for all nodes in a copy of the
> > > CIB, then
> > > load that into the live cluster.
> > > 
> > > stop‑all‑resources as suggested would be another way around it
> > > (and
> > > would have to be cleared after start‑up, which could be a plus or
> > > a
> > > minus depending on how much control vs convenience you want).
> > > 
> > 
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users 
> > 
> > ClusterLabs home: https://www.clusterlabs.org/ 
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot <kgaillot at redhat.com>