[ClusterLabs] Stopping all nodes causes servers to migrate

Tue Jan 26 10:15:55 EST 2021

Dne 25. 01. 21 v 17:01 Ken Gaillot napsal(a):
> On Mon, 2021-01-25 at 09:51 +0100, Jehan-Guillaume de Rorthais wrote:
>> Hi Digimer,
>>
>> On Sun, 24 Jan 2021 15:31:22 -0500
>> Digimer <lists at alteeve.ca> wrote:
>> [...]
>>>   I had a test server (srv01-test) running on node 1 (el8-a01n01),
>>> and on
>>> node 2 (el8-a01n02) I ran 'pcs cluster stop --all'.
>>>
>>>    It appears like pacemaker asked the VM to migrate to node 2
>>> instead of
>>> stopping it. Once the server was on node 2, I couldn't use 'pcs
>>> resource
>>> disable <vm>' as it returned that that resource was unmanaged, and
>>> the
>>> cluster shut down was hung. When I directly stopped the VM and then
>>> did
>>> a 'pcs resource cleanup', the cluster shutdown completed.
>>
>> As actions during a cluster shutdown cannot be handled in the same
>> transition
>> for each nodes, I usually add a step to disable all resources using
>> property
>> "stop-all-resources" before shutting down the cluster:
>>
>>    pcs property set stop-all-resources=true
>>    pcs cluster stop --all
>>
>> But it seems there's a very new cluster property to handle that
>> (IIRC, one or
>> two releases ago). Look at "shutdown-lock" doc:
>>
>>    [...]
>>    some users prefer to make resources highly available only for
>> failures, with
>>    no recovery for clean shutdowns. If this option is true, resources
>> active on a
>>    node when it is cleanly shut down are kept "locked" to that node
>> (not allowed
>>    to run elsewhere) until they start again on that node after it
>> rejoins (or
>>    for at most shutdown-lock-limit, if set).
>>    [...]
>>
>> [...]
>>>    So as best as I can tell, pacemaker really did ask for a
>>> migration. Is
>>> this the case?
>>
>> AFAIK, yes, because each cluster shutdown request is handled
>> independently at
>> node level. There's a large door open for all kind of race conditions
>> if
>> requests are handled with some random lags on each nodes.
> 
> I'm going to guess that's what happened.
> 
> The basic issue is that there is no "cluster shutdown" in Pacemaker,
> only "node shutdown". I'm guessing "pcs cluster stop --all" sends
> shutdown requests for each node in sequence (probably via systemd), and
> if the nodes are quick enough, one could start migrating off resources
> before all the others get their shutdown request.

Pcs is doing its best to stop nodes in parallel. The first 
implementation of this was done back in 2015:
https://bugzilla.redhat.com/show_bug.cgi?id=1180506
Since then, we moved to using curl for network communication, which also 
handles parallel cluster stop. Obviously, this doesn't ensure the stop 
command arrives to and is processed on all nodes at the exactly same time.

Basically, pcs sends 'stop pacemaker' request to all nodes in parallel 
and waits for it to finish on all nodes. Then it sends 'stop corosync' 
request to all nodes in parallel. The actual stopping on each node is 
done by 'systemctl stop'.

Yes, the nodes which get the request sooner may start migrating resources.

Regards,
Tomas

> 
> There would be a way around it. Normally Pacemaker is shut down via
> SIGTERM to pacemakerd (which is what systemctl stop does), but inside
> Pacemaker it's implemented as a special "shutdown" transient node
> attribute, set to the epoch timestamp of the request. It would be
> possible to set that attribute for all nodes in a copy of the CIB, then
> load that into the live cluster.
> 
> stop-all-resources as suggested would be another way around it (and
> would have to be cleared after start-up, which could be a plus or a
> minus depending on how much control vs convenience you want).
>