[ClusterLabs] Antw: [EXT] Re: Stopping all nodes causes servers to migrate

Wed Jan 27 02:35:01 EST 2021

>>> Tomas Jelinek <tojeline at redhat.com> schrieb am 26.01.2021 um 16:15 in
Nachricht
<48f935a5-184f-d2d7-7f1a-db596aa6c72c at redhat.com>:
> Dne 25. 01. 21 v 17:01 Ken Gaillot napsal(a):
>> On Mon, 2021‑01‑25 at 09:51 +0100, Jehan‑Guillaume de Rorthais wrote:
>>> Hi Digimer,
>>>
>>> On Sun, 24 Jan 2021 15:31:22 ‑0500
>>> Digimer <lists at alteeve.ca> wrote:
>>> [...]
>>>>   I had a test server (srv01‑test) running on node 1 (el8‑a01n01),
>>>> and on
>>>> node 2 (el8‑a01n02) I ran 'pcs cluster stop ‑‑all'.
>>>>
>>>>    It appears like pacemaker asked the VM to migrate to node 2
>>>> instead of
>>>> stopping it. Once the server was on node 2, I couldn't use 'pcs
>>>> resource
>>>> disable <vm>' as it returned that that resource was unmanaged, and
>>>> the
>>>> cluster shut down was hung. When I directly stopped the VM and then
>>>> did
>>>> a 'pcs resource cleanup', the cluster shutdown completed.
>>>
>>> As actions during a cluster shutdown cannot be handled in the same
>>> transition
>>> for each nodes, I usually add a step to disable all resources using
>>> property
>>> "stop‑all‑resources" before shutting down the cluster:
>>>
>>>    pcs property set stop‑all‑resources=true
>>>    pcs cluster stop ‑‑all
>>>
>>> But it seems there's a very new cluster property to handle that
>>> (IIRC, one or
>>> two releases ago). Look at "shutdown‑lock" doc:
>>>
>>>    [...]
>>>    some users prefer to make resources highly available only for
>>> failures, with
>>>    no recovery for clean shutdowns. If this option is true, resources
>>> active on a
>>>    node when it is cleanly shut down are kept "locked" to that node
>>> (not allowed
>>>    to run elsewhere) until they start again on that node after it
>>> rejoins (or
>>>    for at most shutdown‑lock‑limit, if set).
>>>    [...]
>>>
>>> [...]
>>>>    So as best as I can tell, pacemaker really did ask for a
>>>> migration. Is
>>>> this the case?
>>>
>>> AFAIK, yes, because each cluster shutdown request is handled
>>> independently at
>>> node level. There's a large door open for all kind of race conditions
>>> if
>>> requests are handled with some random lags on each nodes.
>> 
>> I'm going to guess that's what happened.
>> 
>> The basic issue is that there is no "cluster shutdown" in Pacemaker,
>> only "node shutdown". I'm guessing "pcs cluster stop ‑‑all" sends
>> shutdown requests for each node in sequence (probably via systemd), and
>> if the nodes are quick enough, one could start migrating off resources
>> before all the others get their shutdown request.
> 
> Pcs is doing its best to stop nodes in parallel. The first 
> implementation of this was done back in 2015:
> https://bugzilla.redhat.com/show_bug.cgi?id=1180506 
> Since then, we moved to using curl for network communication, which also 
> handles parallel cluster stop. Obviously, this doesn't ensure the stop 
> command arrives to and is processed on all nodes at the exactly same time.
> 
> Basically, pcs sends 'stop pacemaker' request to all nodes in parallel 
> and waits for it to finish on all nodes. Then it sends 'stop corosync' 
> request to all nodes in parallel. The actual stopping on each node is 
> done by 'systemctl stop'.

Hi!

I wonder: Is there actually a "stop node" command in the communication
protocol, or doe just just kill the crmd remotely?
In the first case (command exists), we would only neeed a gouping for multiple
commands, and we'd have a cluster shutdown:
One node sends a group of commands to stop every node. The nodes acknowledge
and then begin to stop...
(A "group of commands" is like a single database transaction containing
multiple changes)

Regards,
Ulrich

> 
> Yes, the nodes which get the request sooner may start migrating resources.
> 
> Regards,
> Tomas
> 
>> 
>> There would be a way around it. Normally Pacemaker is shut down via
>> SIGTERM to pacemakerd (which is what systemctl stop does), but inside
>> Pacemaker it's implemented as a special "shutdown" transient node
>> attribute, set to the epoch timestamp of the request. It would be
>> possible to set that attribute for all nodes in a copy of the CIB, then
>> load that into the live cluster.
>> 
>> stop‑all‑resources as suggested would be another way around it (and
>> would have to be cleared after start‑up, which could be a plus or a
>> minus depending on how much control vs convenience you want).
>> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/