[ClusterLabs] Antw: Re: Antw: [EXT] Re: Stopping all nodes causes servers to migrate

Ken Gaillot kgaillot at redhat.com
Thu Jan 28 09:56:20 EST 2021


On Thu, 2021-01-28 at 11:12 +0100, Ulrich Windl wrote:
> > > > Ken Gaillot <kgaillot at redhat.com> schrieb am 27.01.2021 um
> > > > 18:46 in
> 
> Nachricht
> <02cd90fcc10f1021d9f51649e2991da3209a6935.camel at redhat.com>:
> > On Wed, 2021-01-27 at 08:35 +0100, Ulrich Windl wrote:
> > > > > > Tomas Jelinek <tojeline at redhat.com> schrieb am 26.01.2021
> > > > > > um
> > > > > > 16:15 in
> > > 
> > > Nachricht
> > > <48f935a5-184f-d2d7-7f1a-db596aa6c72c at redhat.com>:
> > > > Dne 25. 01. 21 v 17:01 Ken Gaillot napsal(a):
> > > > > On Mon, 2021‑01‑25 at 09:51 +0100, Jehan‑Guillaume de
> > > > > Rorthais
> > > > > wrote:
> > > > > > Hi Digimer,
> > > > > > 
> > > > > > On Sun, 24 Jan 2021 15:31:22 ‑0500
> > > > > > Digimer <lists at alteeve.ca> wrote:
> > > > > > [...]
> > > > > > >   I had a test server (srv01‑test) running on node 1
> > > > > > > (el8‑a01n01),
> > > > > > > and on
> > > > > > > node 2 (el8‑a01n02) I ran 'pcs cluster stop ‑‑all'.
> > > > > > > 
> > > > > > >    It appears like pacemaker asked the VM to migrate to
> > > > > > > node
> > > > > > > 2
> > > > > > > instead of
> > > > > > > stopping it. Once the server was on node 2, I couldn't
> > > > > > > use
> > > > > > > 'pcs
> > > > > > > resource
> > > > > > > disable <vm>' as it returned that that resource was
> > > > > > > unmanaged, and
> > > > > > > the
> > > > > > > cluster shut down was hung. When I directly stopped the
> > > > > > > VM
> > > > > > > and then
> > > > > > > did
> > > > > > > a 'pcs resource cleanup', the cluster shutdown completed.
> > > > > > 
> > > > > > As actions during a cluster shutdown cannot be handled in
> > > > > > the
> > > > > > same
> > > > > > transition
> > > > > > for each nodes, I usually add a step to disable all
> > > > > > resources
> > > > > > using
> > > > > > property
> > > > > > "stop‑all‑resources" before shutting down the cluster:
> > > > > > 
> > > > > >    pcs property set stop‑all‑resources=true
> > > > > >    pcs cluster stop ‑‑all
> > > > > > 
> > > > > > But it seems there's a very new cluster property to handle
> > > > > > that
> > > > > > (IIRC, one or
> > > > > > two releases ago). Look at "shutdown‑lock" doc:
> > > > > > 
> > > > > >    [...]
> > > > > >    some users prefer to make resources highly available
> > > > > > only
> > > > > > for
> > > > > > failures, with
> > > > > >    no recovery for clean shutdowns. If this option is true,
> > > > > > resources
> > > > > > active on a
> > > > > >    node when it is cleanly shut down are kept "locked" to
> > > > > > that
> > > > > > node
> > > > > > (not allowed
> > > > > >    to run elsewhere) until they start again on that node
> > > > > > after
> > > > > > it
> > > > > > rejoins (or
> > > > > >    for at most shutdown‑lock‑limit, if set).
> > > > > >    [...]
> > > > > > 
> > > > > > [...]
> > > > > > >    So as best as I can tell, pacemaker really did ask for
> > > > > > > a
> > > > > > > migration. Is
> > > > > > > this the case?
> > > > > > 
> > > > > > AFAIK, yes, because each cluster shutdown request is
> > > > > > handled
> > > > > > independently at
> > > > > > node level. There's a large door open for all kind of race
> > > > > > conditions
> > > > > > if
> > > > > > requests are handled with some random lags on each nodes.
> > > > > 
> > > > > I'm going to guess that's what happened.
> > > > > 
> > > > > The basic issue is that there is no "cluster shutdown" in
> > > > > Pacemaker,
> > > > > only "node shutdown". I'm guessing "pcs cluster stop ‑‑all"
> > > > > sends
> > > > > shutdown requests for each node in sequence (probably via
> > > > > systemd), and
> > > > > if the nodes are quick enough, one could start migrating off
> > > > > resources
> > > > > before all the others get their shutdown request.
> > > > 
> > > > Pcs is doing its best to stop nodes in parallel. The first 
> > > > implementation of this was done back in 2015:
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1180506 
> > > > Since then, we moved to using curl for network communication,
> > > > which
> > > > also 
> > > > handles parallel cluster stop. Obviously, this doesn't ensure
> > > > the
> > > > stop 
> > > > command arrives to and is processed on all nodes at the exactly
> > > > same time.
> > > > 
> > > > Basically, pcs sends 'stop pacemaker' request to all nodes in
> > > > parallel 
> > > > and waits for it to finish on all nodes. Then it sends 'stop
> > > > corosync' 
> > > > request to all nodes in parallel. The actual stopping on each
> > > > node
> > > > is 
> > > > done by 'systemctl stop'.
> > > 
> > > Hi!
> > > 
> > > I wonder: Is there actually a "stop node" command in the
> > > communication
> > > protocol, or doe just just kill the crmd remotely?
> > > In the first case (command exists), we would only neeed a gouping
> > > for
> > > multiple
> > > commands, and we'd have a cluster shutdown:
> > > One node sends a group of commands to stop every node. The nodes
> > > acknowledge
> > > and then begin to stop...
> > > (A "group of commands" is like a single database transaction
> > > containing
> > > multiple changes)
> > > 
> > > Regards,
> > > Ulrich
> 
> Hi Ken!
> 
> as I periodically forget: Thanks once again for explaining!
> 
> > This is the current sequence of a clean shutdown for one node:
> > 
> > 1. Someone or something (e.g. systemctl stop) sends SIGTERM to
> > pacemakerd on the node to be shut down.
> > 
> > 2. pacemakerd relays that signal to all the subdaemons on the node
> > and
> > waits for them to exit before exiting itself.
> > 
> > 3. When the controller gets the SIGTERM, it sends a shutdown
> > request to
> > the controller on the DC node.
> 
> I wasn't aware that DC is a process of it's own; I thought it's just
> a role of
> the crmd (seems to be called pacemaker-controld nowadays).

That's correct, it's just a role of the controller process, not a
separate process. Very similar to promotable clones -- the controller
could be considered a clone on every node, with one instance promoted
to DC.

> > 4. When the DC receives the node's shutdown request, it sets a
> > "shutdown" node attribute for the node and invokes the scheduler,
> > which
> > schedules all appropriate actions (stopping or moving resources,
> > etc.).
> 
> But it does not trigger an "election" at that point (when the DC is
> to be
> shutdown); right? Only when the DC left membership an election is
> triggered;
> right?

Yes, though technically it's the corosync process group rather than
membership. The membership determines quorum, the process group allows
inter-node communication. But it's all corosync and normally the
distinction doesn't matter.

> > 6. The DC coordinates all the necessary actions that were
> > scheduled,
> > then sends a confirmation to the node that requested shutdown.
> 
> That happens _after_ all the resources were migrated or stopped?

Yes

> > 7. When the controller receives the confirmation, it exits.
> > 
> > So ...
> > 
> > A "shut down the whole cluster" command should be possible, but the
> > process would need significant redesign. Currently a node has to
> > initiate its own shutdown, because the local pacemakerd and
> > controller
> > have to be aware it's happening.
> > 
> > I envision a new controller API request for cluster shutdown that
> > would
> > be relayed to all controllers, and each controller would send
> > SIGTERM
> > to the local pacemakerd. The DC would additionally set the shutdown
> > attribute for all nodes at once and invoke the scheduler. Timing
> > and
> > corner cases would require a lot of attention (no DC elected, any
> > node
> > crashing at any point in the process, etc.).
> 
> A tricky part could be that of confirmation: I had implemented a
> syslogd
> featuring a control protocol that allows remote restart (after
> upgrading the
> "binary") and shutdown. The command interpreter parses the command,
> executes
> it, then reports back the result. That is a bit hairy for shutdown
> and
> reload/restart: The process can't report back that it had been
> shutting down
> itself.
> So I tricked, queueing the shutdown/resrat commands, confirm (the
> queueing),
> and then process the queue...
> 
> AFAIK "shutdown confirmation" in the cluster is indirect: You know
> that the
> node had shut down wehn it stops responding...
> 
> > 
> > It should be feasible, someone would just need time to do it.
> 
> Sometimes it's like two months of thinking and one day of coding ;-)
> 
> Regards,
> Ulrich
> 
> > 
> > > > Yes, the nodes which get the request sooner may start migrating
> > > > resources.
> > > > 
> > > > Regards,
> > > > Tomas
> > > > 
> > > > > 
> > > > > There would be a way around it. Normally Pacemaker is shut
> > > > > down
> > > > > via
> > > > > SIGTERM to pacemakerd (which is what systemctl stop does),
> > > > > but
> > > > > inside
> > > > > Pacemaker it's implemented as a special "shutdown" transient
> > > > > node
> > > > > attribute, set to the epoch timestamp of the request. It
> > > > > would be
> > > > > possible to set that attribute for all nodes in a copy of the
> > > > > CIB, then
> > > > > load that into the live cluster.
> > > > > 
> > > > > stop‑all‑resources as suggested would be another way around
> > > > > it
> > > > > (and
> > > > > would have to be cleared after start‑up, which could be a
> > > > > plus or
> > > > > a
> > > > > minus depending on how much control vs convenience you want).
> > > > > 
-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Users mailing list