[ClusterLabs] Antw: Re: Antw: [EXT] Coming in Pacemaker 2.0.4: shutdown locks

Thu Feb 27 11:28:07 EST 2020

On Thu, 27 Feb 2020 09:48:23 -0600
Ken Gaillot <kgaillot at redhat.com> wrote:

> On Thu, 2020-02-27 at 15:01 +0100, Jehan-Guillaume de Rorthais wrote:
> > On Thu, 27 Feb 2020 12:24:46 +0100
> > "Ulrich Windl" <Ulrich.Windl at rz.uni-regensburg.de> wrote:
> >   
> > > > > > Jehan-Guillaume de Rorthais <jgdr at dalibo.com> schrieb am
> > > > > > 27.02.2020 um    
> > > 
> > > 11:05 in
> > > Nachricht <20200227110502.3624cb87 at firost>:
> > > 
> > > [...]  
> > > > What about something like "lock‑location=bool" and    
> > > 
> > > For "lock-location" I would assume the value is a "location". I
> > > guess you
> > > wanted a "use-lock-location" Boolean value.  
> > 
> > Mh, maybe "lock-current-location" would better reflect what I meant.
> > 
> > The point is to lock the resource on the node currently running it.  
> 
> Though it only applies for a clean node shutdown, so that has to be in
> the name somewhere. The resource isn't locked during normal cluster
> operation (it can move for resource or node failures, load rebalancing,
> etc.).

Well, I was trying to make the new feature a bit wider than just the
narrow shutdown feature.

Speaking about shutdown, what is the status of clean shutdown of the cluster
handled by Pacemaker? Currently, I advice to stop resources gracefully (eg.
using pcs resource disable [...]) before shutting down each nodes either by hand
or using some higher level tool (eg. pcs cluster stop --all).

Shouldn't this feature be discussed in this context as well?

[...] 
> > > > it would lock the resource location (unique or clones) until the
> > > > operator unlock it or the "lock‑location‑timeout" expire. No matter what
> > > > happen to the resource, maintenance mode or not.
> > > > 
> > > > At a first look, it looks to peer nicely with maintenance‑mode
> > > > and avoid resource migration after node reboot.    
> 
> Maintenance mode is useful if you're updating the cluster stack itself
> -- put in maintenance mode, stop the cluster services (leaving the
> managed services still running), update the cluster services, start the
> cluster services again, take out of maintenance mode.
> 
> This is useful if you're rebooting the node for a kernel update (for
> example). Apply the update, reboot the node. The cluster takes care of
> everything else for you (stop the services before shutting down and do
> not recover them until the node comes back).

I'm a bit lost. If resource doesn't move during maintenance mode,
could you detail a scenario where we should ban it explicitly from other node to
secure its current location when getting out of maintenance? Isn't it excessive
precaution? Is it just to avoid is to move somewhere else when exiting
maintenance-mode? If the resource has a preferred node, I suppose the location
constraint should take care of this, isn't it?

> > > I wonder: Where is it different from a time-limited "ban" (wording
> > > also exists
> > > already)? If you ban all resources from running on a specific node,
> > > resources
> > > would be move away, and when booting the node, resources won't come
> > > back.  
> 
> It actually is equivalent to this process:
> 
> 1. Determine what resources are active on the node about to be shut
> down.
> 2. For each of those resources, configure a ban (location constraint
> with -INFINITY score) using a rule where node name is not the node
> being shut down.
> 3. Apply the updates and reboot the node. The cluster will stop the
> resources (due to shutdown) and not start them anywhere else (due to
> the bans).

In maintenance mode, this would not move either.

> 4. Wait for the node to rejoin and the resources to start on it again,
> then remove all the bans.
> 
> The advantage is automation, and in particular the sysadmin applying
> the updates doesn't need to even know that the host is part of a
> cluster.

Could you elaborate? I suppose the operator still need to issue a command to
set the shutdown‑lock before reboot, isn't it?

Moreover, if shutdown‑lock is just a matter of setting ±infinity constraint on
nodes, maybe a higher level tool can take care of this?

> > This is the standby mode.  
> 
> Standby mode will stop all resources on a node, but it doesn't prevent
> recovery elsewhere.

Yes, I was just commenting on Ulrich's description (history context crop'ed
here).