[ClusterLabs] Antw: Re: Antw: [EXT] Coming in Pacemaker 2.0.4: shutdown locks

Thu Feb 27 12:43:57 EST 2020

On Thu, 27 Feb 2020 11:00:36 -0600
Ken Gaillot <kgaillot at redhat.com> wrote:

> On Thu, 2020-02-27 at 17:28 +0100, Jehan-Guillaume de Rorthais wrote:
> > On Thu, 27 Feb 2020 09:48:23 -0600
> > Ken Gaillot <kgaillot at redhat.com> wrote:
> >   
> > > On Thu, 2020-02-27 at 15:01 +0100, Jehan-Guillaume de Rorthais
> > > wrote:  
> > > > On Thu, 27 Feb 2020 12:24:46 +0100
> > > > "Ulrich Windl" <Ulrich.Windl at rz.uni-regensburg.de> wrote:
> > > >     
> > > > > > > > Jehan-Guillaume de Rorthais <jgdr at dalibo.com> schrieb am
> > > > > > > > 27.02.2020 um      
> > > > > 
> > > > > 11:05 in
> > > > > Nachricht <20200227110502.3624cb87 at firost>:
> > > > > 
> > > > > [...]    
> > > > > > What about something like "lock‑location=bool" and      
> > > > > 
> > > > > For "lock-location" I would assume the value is a "location". I
> > > > > guess you
> > > > > wanted a "use-lock-location" Boolean value.    
> > > > 
> > > > Mh, maybe "lock-current-location" would better reflect what I
> > > > meant.
> > > > 
> > > > The point is to lock the resource on the node currently running
> > > > it.    
> > > 
> > > Though it only applies for a clean node shutdown, so that has to be
> > > in the name somewhere. The resource isn't locked during normal cluster
> > > operation (it can move for resource or node failures, load
> > > rebalancing,
> > > etc.).  
> > 
> > Well, I was trying to make the new feature a bit wider than just the
> > narrow shutdown feature.
> > 
> > Speaking about shutdown, what is the status of clean shutdown of the
> > cluster handled by Pacemaker? Currently, I advice to stop resources
> > gracefully (eg. using pcs resource disable [...]) before shutting down each
> > nodes either by hand or using some higher level tool (eg. pcs cluster stop
> > --all).  
> 
> I'm not sure why that would be necessary. It should be perfectly fine
> to stop pacemaker in any order without disabling resources.

Because resources might move around during the shutdown sequence. It might
not be desirable as some resource migration can be heavy, long, interfere
with shutdown, etc. I'm pretty sure this has been discussed in the past.

> Start-up is actually more of an issue ... if you start corosync and
> pacemaker on nodes one by one, and you're not quick enough, then once
> quorum is reached, the cluster will fence all the nodes that haven't
> yet come up. So on start-up, it makes sense to start corosync on all
> nodes, which will establish membership and quorum, then start pacemaker
> on all nodes. Obviously that can't be done within pacemaker so that has
> to be done manually or by a higher-level tool.

Indeed.
Or use wait-for-all.

> > Shouldn't this feature be discussed in this context as well?
> > 
> > [...]   
> > > > > > it would lock the resource location (unique or clones) until
> > > > > > the
> > > > > > operator unlock it or the "lock‑location‑timeout" expire. No
> > > > > > matter what
> > > > > > happen to the resource, maintenance mode or not.
> > > > > > 
> > > > > > At a first look, it looks to peer nicely with
> > > > > > maintenance‑mode
> > > > > > and avoid resource migration after node reboot.      
> > > 
> > > Maintenance mode is useful if you're updating the cluster stack
> > > itself
> > > -- put in maintenance mode, stop the cluster services (leaving the
> > > managed services still running), update the cluster services, start
> > > the cluster services again, take out of maintenance mode.
> > > 
> > > This is useful if you're rebooting the node for a kernel update
> > > (for example). Apply the update, reboot the node. The cluster takes care
> > > of everything else for you (stop the services before shutting down and
> > > do not recover them until the node comes back).  
> > 
> > I'm a bit lost. If resource doesn't move during maintenance mode,
> > could you detail a scenario where we should ban it explicitly from
> > other node to secure its current location when getting out of maintenance?
> > Isn't it  
> 
> Sorry, I was unclear -- I was contrasting maintenance mode with
> shutdown locks.
> 
> You wouldn't need a ban with maintenance mode. However maintenance mode
> leaves any active resources running. That means the node shouldn't be
> rebooted in maintenance mode, because those resources will not be
> cleanly stopped.
> 
> With shutdown locks, the active resources are cleanly stopped. That
> does require a ban of some sort because otherwise the resources will be
> recovered on another node.

ok, thanks,

> > excessive precaution? Is it just to avoid is to move somewhere else when
> > exiting maintenance-mode? If the resource has a preferred node, I suppose
> > the location constraint should take care of this, isn't it?  
> 
> Having a preferred node doesn't prevent the resource from starting
> elsewhere if the preferred node is down (or in standby, or otherwise
> ineligible to run the resource). Even a +INFINITY constraint allows
> recovery elsewhere if the node is not available. To keep a resource
> from being recovered, you have to put a ban (-INFINITY location
> constraint) on any nodes that could otherwise run it.

I was referring to the location constraint *when exiting maintenance node*. So
the node should be around.
But then, I realize you have to set the cluster in maintenance mode, put the
node in standby, reboot, unstandby and leave maintenance mode. This is quite a
procedure just for a node restart. I understand the need for a feature to make
it easier. And using ban is surely cleaner.

> > > > > I wonder: Where is it different from a time-limited "ban"
> > > > > (wording also exists already)? If you ban all resources from running
> > > > > on a specific node, resources would be move away, and when booting
> > > > > the node, resources won't come back.    
> > > 
> > > It actually is equivalent to this process:
> > > 
> > > 1. Determine what resources are active on the node about to be shut
> > > down.
> > > 2. For each of those resources, configure a ban (location
> > > constraint
> > > with -INFINITY score) using a rule where node name is not the node
> > > being shut down.
> > > 3. Apply the updates and reboot the node. The cluster will stop the
> > > resources (due to shutdown) and not start them anywhere else (due
> > > to the bans).  
> > 
> > In maintenance mode, this would not move either.  
> 
> The problem with maintenance mode for this scenario is that the reboot
> would uncleanly terminate any active resources.

right, so you have to disable them by hands, not really handy.

> > > 4. Wait for the node to rejoin and the resources to start on it
> > > again, then remove all the bans.
> > > 
> > > The advantage is automation, and in particular the sysadmin applying
> > > the updates doesn't need to even know that the host is part of a
> > > cluster.  
> > 
> > Could you elaborate? I suppose the operator still need to issue a command
> > to set the shutdown‑lock before reboot, isn't it?  
> 
> Ah, no -- this is intended as a permanent cluster configuration
> setting, always in effect.

Oh. OK. I have a better understanding now! Thanks for your patience :)

Note I still agree a per resource attribute seems better than a cluster wide
one.

I wonder if this could be fine-tuned by the cluster-admins who actually knows
what to do for each resources on node startup/shutdown. Eg. special date
specifications "on-startup" and "on-shutdown" might be useful to apply some
rules on resources equivalent to the lock-shutdown feature you are describing,
isn't it? And it seem like a wider feature.

> > Moreover, if shutdown‑lock is just a matter of setting ±infinity
> > constraint on nodes, maybe a higher level tool can take care of this?  
> 
> In this case, the operator applying the reboot may not even know what
> pacemaker is, much less what command to run. The goal is to fully
> automate the process so a cluster-aware administrator does not need to
> be present.
> 
> I did consider a number of alternative approaches, but they all had
> problematic corner cases. For a higher-level tool or anything external
> to pacemaker, one such corner case is a "time-of-check/time-of-use"
> problem -- determining the list of active resources has to be done
> separately from configuring the bans, and it's possible the list could
> change in the meantime.

It's quite scary if the list is changing in the mean time. If an unknowledged
ops is shutting down the node while some other one is doing some setup, there's
a problem somewhere in the company.

But this is not our concern anyway :)