[ClusterLabs] Antw: Re: Antw: [EXT] Re: Stopping all nodes causes servers to migrate

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Thu Jan 28 05:12:54 EST 2021


>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 27.01.2021 um 18:46 in
Nachricht
<02cd90fcc10f1021d9f51649e2991da3209a6935.camel at redhat.com>:
> On Wed, 2021-01-27 at 08:35 +0100, Ulrich Windl wrote:
>> > > > Tomas Jelinek <tojeline at redhat.com> schrieb am 26.01.2021 um
>> > > > 16:15 in
>> 
>> Nachricht
>> <48f935a5-184f-d2d7-7f1a-db596aa6c72c at redhat.com>:
>> > Dne 25. 01. 21 v 17:01 Ken Gaillot napsal(a):
>> > > On Mon, 2021‑01‑25 at 09:51 +0100, Jehan‑Guillaume de Rorthais
>> > > wrote:
>> > > > Hi Digimer,
>> > > > 
>> > > > On Sun, 24 Jan 2021 15:31:22 ‑0500
>> > > > Digimer <lists at alteeve.ca> wrote:
>> > > > [...]
>> > > > >   I had a test server (srv01‑test) running on node 1
>> > > > > (el8‑a01n01),
>> > > > > and on
>> > > > > node 2 (el8‑a01n02) I ran 'pcs cluster stop ‑‑all'.
>> > > > > 
>> > > > >    It appears like pacemaker asked the VM to migrate to node
>> > > > > 2
>> > > > > instead of
>> > > > > stopping it. Once the server was on node 2, I couldn't use
>> > > > > 'pcs
>> > > > > resource
>> > > > > disable <vm>' as it returned that that resource was
>> > > > > unmanaged, and
>> > > > > the
>> > > > > cluster shut down was hung. When I directly stopped the VM
>> > > > > and then
>> > > > > did
>> > > > > a 'pcs resource cleanup', the cluster shutdown completed.
>> > > > 
>> > > > As actions during a cluster shutdown cannot be handled in the
>> > > > same
>> > > > transition
>> > > > for each nodes, I usually add a step to disable all resources
>> > > > using
>> > > > property
>> > > > "stop‑all‑resources" before shutting down the cluster:
>> > > > 
>> > > >    pcs property set stop‑all‑resources=true
>> > > >    pcs cluster stop ‑‑all
>> > > > 
>> > > > But it seems there's a very new cluster property to handle that
>> > > > (IIRC, one or
>> > > > two releases ago). Look at "shutdown‑lock" doc:
>> > > > 
>> > > >    [...]
>> > > >    some users prefer to make resources highly available only
>> > > > for
>> > > > failures, with
>> > > >    no recovery for clean shutdowns. If this option is true,
>> > > > resources
>> > > > active on a
>> > > >    node when it is cleanly shut down are kept "locked" to that
>> > > > node
>> > > > (not allowed
>> > > >    to run elsewhere) until they start again on that node after
>> > > > it
>> > > > rejoins (or
>> > > >    for at most shutdown‑lock‑limit, if set).
>> > > >    [...]
>> > > > 
>> > > > [...]
>> > > > >    So as best as I can tell, pacemaker really did ask for a
>> > > > > migration. Is
>> > > > > this the case?
>> > > > 
>> > > > AFAIK, yes, because each cluster shutdown request is handled
>> > > > independently at
>> > > > node level. There's a large door open for all kind of race
>> > > > conditions
>> > > > if
>> > > > requests are handled with some random lags on each nodes.
>> > > 
>> > > I'm going to guess that's what happened.
>> > > 
>> > > The basic issue is that there is no "cluster shutdown" in
>> > > Pacemaker,
>> > > only "node shutdown". I'm guessing "pcs cluster stop ‑‑all" sends
>> > > shutdown requests for each node in sequence (probably via
>> > > systemd), and
>> > > if the nodes are quick enough, one could start migrating off
>> > > resources
>> > > before all the others get their shutdown request.
>> > 
>> > Pcs is doing its best to stop nodes in parallel. The first 
>> > implementation of this was done back in 2015:
>> > https://bugzilla.redhat.com/show_bug.cgi?id=1180506 
>> > Since then, we moved to using curl for network communication, which
>> > also 
>> > handles parallel cluster stop. Obviously, this doesn't ensure the
>> > stop 
>> > command arrives to and is processed on all nodes at the exactly
>> > same time.
>> > 
>> > Basically, pcs sends 'stop pacemaker' request to all nodes in
>> > parallel 
>> > and waits for it to finish on all nodes. Then it sends 'stop
>> > corosync' 
>> > request to all nodes in parallel. The actual stopping on each node
>> > is 
>> > done by 'systemctl stop'.
>> 
>> Hi!
>> 
>> I wonder: Is there actually a "stop node" command in the
>> communication
>> protocol, or doe just just kill the crmd remotely?
>> In the first case (command exists), we would only neeed a gouping for
>> multiple
>> commands, and we'd have a cluster shutdown:
>> One node sends a group of commands to stop every node. The nodes
>> acknowledge
>> and then begin to stop...
>> (A "group of commands" is like a single database transaction
>> containing
>> multiple changes)
>> 
>> Regards,
>> Ulrich
> 

Hi Ken!

as I periodically forget: Thanks once again for explaining!

> This is the current sequence of a clean shutdown for one node:
> 
> 1. Someone or something (e.g. systemctl stop) sends SIGTERM to
> pacemakerd on the node to be shut down.
> 
> 2. pacemakerd relays that signal to all the subdaemons on the node and
> waits for them to exit before exiting itself.
> 
> 3. When the controller gets the SIGTERM, it sends a shutdown request to
> the controller on the DC node.

I wasn't aware that DC is a process of it's own; I thought it's just a role of
the crmd (seems to be called pacemaker-controld nowadays).

> 
> 4. When the DC receives the node's shutdown request, it sets a
> "shutdown" node attribute for the node and invokes the scheduler, which
> schedules all appropriate actions (stopping or moving resources, etc.).

But it does not trigger an "election" at that point (when the DC is to be
shutdown); right? Only when the DC left membership an election is triggered;
right?

> 
> 6. The DC coordinates all the necessary actions that were scheduled,
> then sends a confirmation to the node that requested shutdown.

That happens _after_ all the resources were migrated or stopped?

> 
> 7. When the controller receives the confirmation, it exits.
> 
> So ...
> 
> A "shut down the whole cluster" command should be possible, but the
> process would need significant redesign. Currently a node has to
> initiate its own shutdown, because the local pacemakerd and controller
> have to be aware it's happening.
> 
> I envision a new controller API request for cluster shutdown that would
> be relayed to all controllers, and each controller would send SIGTERM
> to the local pacemakerd. The DC would additionally set the shutdown
> attribute for all nodes at once and invoke the scheduler. Timing and
> corner cases would require a lot of attention (no DC elected, any node
> crashing at any point in the process, etc.).

A tricky part could be that of confirmation: I had implemented a syslogd
featuring a control protocol that allows remote restart (after upgrading the
"binary") and shutdown. The command interpreter parses the command, executes
it, then reports back the result. That is a bit hairy for shutdown and
reload/restart: The process can't report back that it had been shutting down
itself.
So I tricked, queueing the shutdown/resrat commands, confirm (the queueing),
and then process the queue...

AFAIK "shutdown confirmation" in the cluster is indirect: You know that the
node had shut down wehn it stops responding...

> 
> It should be feasible, someone would just need time to do it.

Sometimes it's like two months of thinking and one day of coding ;-)

Regards,
Ulrich

> 
>> > Yes, the nodes which get the request sooner may start migrating
>> > resources.
>> > 
>> > Regards,
>> > Tomas
>> > 
>> > > 
>> > > There would be a way around it. Normally Pacemaker is shut down
>> > > via
>> > > SIGTERM to pacemakerd (which is what systemctl stop does), but
>> > > inside
>> > > Pacemaker it's implemented as a special "shutdown" transient node
>> > > attribute, set to the epoch timestamp of the request. It would be
>> > > possible to set that attribute for all nodes in a copy of the
>> > > CIB, then
>> > > load that into the live cluster.
>> > > 
>> > > stop‑all‑resources as suggested would be another way around it
>> > > (and
>> > > would have to be cleared after start‑up, which could be a plus or
>> > > a
>> > > minus depending on how much control vs convenience you want).
>> > > 
>> > 
>> > _______________________________________________
>> > Manage your subscription:
>> > https://lists.clusterlabs.org/mailman/listinfo/users 
>> > 
>> > ClusterLabs home: https://www.clusterlabs.org/ 
>> 
>> 
>> 
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
> -- 
> Ken Gaillot <kgaillot at redhat.com>
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 





More information about the Users mailing list