[ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Coming in Pacemaker 2.0.4: shutdown locks

Fri Feb 28 10:28:53 EST 2020

On Fri, 2020-02-28 at 09:37 +0100, Ulrich Windl wrote:
> > > > Ken Gaillot <kgaillot at redhat.com> schrieb am 27.02.2020 um
> > > > 23:43 in Nachricht
> 
> <43512a11c2ddffbabeee11cf4cb509e4e5dc98ca.camel at redhat.com>:
> 
> [...]
> > 
> > > 2. Resources/groups  are stopped  (target-role=stopped)
> > > 3. Node exits the cluster cleanly when no resources are  running
> > > any
> > > more
> > > 4. The node rejoins the cluster  after  the reboot
> > > 5. A  positive (on the rebooted node) & negative (ban on the rest
> > > of
> > > the nodes) constraints  are  created for the marked  in step 1
> > > resources
> > > 6.  target-role is  set back to started and the resources are
> > > back
> > > and running
> > > 7. When each resource group (or standalone resource)  is  back
> > > online
> > > -  the mark in step 1  is removed  and any location
> > > constraints  (cli-ban &  cli-prefer)  are  removed  for the
> > > resource/group.
> > 
> > Exactly, that's effectively what happens.
> 
> May I ask how robust the mechanism will be?
> For example if you do  a "resource restart" there are two target
> roles (each made persistent): stopped and started. If the node
> performing the operation is fenced (we had that a few times). The
> resources may remain "stopped" until started manually again.
> I see a similar issue with this mechanism.

Corner cases were carefully considered with this one. If a node is
fenced, its entire CIB status section is cleared, which will include
shutdown locks. I considered alternative implementations under the
hood, and the main advantage of the one chosen is that setting and
clearing the lock are atomic with recording the action results that
cause them. That eliminates a whole lot of possibilities for the type
of problem you mention. Also, there are multiple backstops to clear
locks if anything is fishy, such as if the node is unclean, the
resource somehow started elsewhere while the lock was in effect, a
locked resource is removed from the configuration while it is down,
etc.

The one area I don't consider mature yet is Pacemaker Remote nodes. I'd
recommend using the feature only in a cluster without them. This is due
mainly to a (documented) limitation that manual lock clearing and
shutdown-lock-limit only work if the remote connection is disabled
after stopping the node, which sort of defeats the "hands off" goal.
But also I think using locks with remote nodes requires more testing.

> 
> [...]
-- 
Ken Gaillot <kgaillot at redhat.com>