[ClusterLabs] Antw: Re: Antw: Re: questions about startup fencing

Tue Dec 5 17:31:11 UTC 2017

On Tue, 2017-12-05 at 17:43 +0100, Jehan-Guillaume de Rorthais wrote:
> On Tue, 05 Dec 2017 08:59:55 -0600
> Ken Gaillot <kgaillot at redhat.com> wrote:
> 
> > On Tue, 2017-12-05 at 14:47 +0100, Ulrich Windl wrote:
> > > > > > Tomas Jelinek <tojeline at redhat.com> schrieb am 04.12.2017
> > > > > > um
> > > > > > 16:50 in Nachricht
> > > 
> > > <3e60579c-0f4d-1c32-70fc-d207e0654fbf at redhat.com>:
> > > > Dne 4.12.2017 v 14:21 Jehan-Guillaume de Rorthais napsal(a):
> > > > > On Mon, 4 Dec 2017 12:31:06 +0100
> > > > > Tomas Jelinek <tojeline at redhat.com> wrote:
> > > > > 
> > > > > > Dne 4.12.2017 v 10:36 Jehan-Guillaume de Rorthais
> > > > > > napsal(a):
> > > > > > > On Fri, 01 Dec 2017 16:34:08 -0600
> > > > > > > Ken Gaillot <kgaillot at redhat.com> wrote:
> > > > > > >    
> > > > > > > > On Thu, 2017-11-30 at 07:55 +0100, Ulrich Windl wrote:
> > > > > > > > > 
> > > > > > > > >       
> > > > > > > > > > Kristoffer Gronlund <kgronlund at suse.com> wrote:
> > > > > > > > > > > Adam Spiers <aspiers at suse.com> writes:
> > > > > > > > > > >       
> > > > > > > > > > > > - The whole cluster is shut down cleanly.
> > > > > > > > > > > > 
> > > > > > > > > > > > - The whole cluster is then started up
> > > > > > > > > > > > again.  (Side question:
> > > > > > > > > > > > what
> > > > > > > > > > > >     happens if the last node to shut down is
> > > > > > > > > > > > not
> > > > > > > > > > > > the first to
> > > > > > > > > > > > start up?
> > > > > > > > > > > >     How will the cluster ensure it has the most
> > > > > > > > > > > > recent version of
> > > > > > > > > > > > the
> > > > > > > > > > > >     CIB?  Without that, how would it know
> > > > > > > > > > > > whether
> > > > > > > > > > > > the last man
> > > > > > > > > > > > standing
> > > > > > > > > > > >     was shut down cleanly or not?)
> > > > > > > > > > > 
> > > > > > > > > > > This is my opinion, I don't really know what the
> > > > > > > > > > > "official"
> > > > > > > > > > > pacemaker
> > > > > > > > > > > stance is: There is no such thing as shutting
> > > > > > > > > > > down a
> > > > > > > > > > > cluster
> > > > > > > > > > > cleanly. A
> > > > > > > > > > > cluster is a process stretching over multiple
> > > > > > > > > > > nodes -
> > > > > > > > > > > if they all
> > > > > > > > > > > shut
> > > > > > > > > > > down, the process is gone. When you start up
> > > > > > > > > > > again,
> > > > > > > > > > > you
> > > > > > > > > > > effectively have
> > > > > > > > > > > a completely new cluster.
> > > > > > > > > > 
> > > > > > > > > > Sorry, I don't follow you at all here.  When you
> > > > > > > > > > start
> > > > > > > > > > the cluster
> > > > > > > > > > up
> > > > > > > > > > again, the cluster config from before the shutdown
> > > > > > > > > > is
> > > > > > > > > > still there.
> > > > > > > > > > That's very far from being a completely new cluster
> > > > > > > > > > :-)
> > > > > > > > > 
> > > > > > > > > The problem is you cannot "start the cluster" in
> > > > > > > > > pacemaker; you can
> > > > > > > > > only "start nodes". The nodes will come up one by
> > > > > > > > > one. As
> > > > > > > > > opposed (as
> > > > > > > > > I had said) to HP Sertvice Guard, where there is a
> > > > > > > > > "cluster formation
> > > > > > > > > timeout". That is, the nodes wait for the specified
> > > > > > > > > time
> > > > > > > > > for the
> > > > > > > > > cluster to "form". Then the cluster starts as a
> > > > > > > > > whole. Of
> > > > > > > > > course that
> > > > > > > > > only applies if the whole cluster was down, not if a
> > > > > > > > > single node was
> > > > > > > > > down.
> > > > > > > > 
> > > > > > > > I'm not sure what that would specifically entail, but
> > > > > > > > I'm
> > > > > > > > guessing we
> > > > > > > > have some of the pieces already:
> > > > > > > > 
> > > > > > > > - Corosync has a wait_for_all option if you want the
> > > > > > > > cluster to be
> > > > > > > > unable to have quorum at start-up until every node has
> > > > > > > > joined. I don't
> > > > > > > > think you can set a timeout that cancels it, though.
> > > > > > > > 
> > > > > > > > - Pacemaker will wait dc-deadtime for the first DC
> > > > > > > > election
> > > > > > > > to
> > > > > > > > complete. (if I understand it correctly ...)
> > > > > > > > 
> > > > > > > > - Higher-level tools can start or stop all nodes
> > > > > > > > together
> > > > > > > > (e.g. pcs has
> > > > > > > > pcs cluster start/stop --all).
> > > > > > > 
> > > > > > > Based on this discussion, I have some questions about
> > > > > > > pcs:
> > > > > > > 
> > > > > > > * how is it shutting down the cluster when issuing "pcs
> > > > > > > cluster stop
> > > > > > > --all"?
> > > > > > 
> > > > > > First, it sends a request to each node to stop pacemaker.
> > > > > > The
> > > > > > requests
> > > > > > are sent in parallel which prevents resources from being
> > > > > > moved
> > > > > > from node
> > > > > > to node. Once pacemaker stops on all nodes, corosync is
> > > > > > stopped
> > > > > > on all
> > > > > > nodes in the same manner.
> > > > > 
> > > > > What if for some external reasons one node is slower (load,
> > > > > network, 
> > > > 
> > > > whatever)
> > > > > than the others and start reacting ? Sending queries in
> > > > > parallel
> > > > > doesn't
> > > > > feels safe enough in regard with all the race conditions that
> > > > > can
> > > > > occurs in 
> > > > 
> > > > the
> > > > > same time.
> > > > > 
> > > > > Am I missing something ?
> > > > > 
> > > > 
> > > > If a node gets the request later than others, some resources
> > > > may
> > > > be 
> > > > moved to it before it starts shutting down pacemaker as well.
> > > > Pcs
> > > > waits 
> > > 
> > > I think that's impossible due to the ordering of corosync: If a
> > > standby is issued, and a resource migration is the consequence,
> > > every
> > > node will see the standby before it sees any other config change.
> > > Right?
> > 
> > pcs doesn't issue a standby, just a shutdown.
> > 
> > When a node needs to shut down, it sends a shutdown request to the
> > DC,
> > which sets a "shutdown" node attribute, which tells the policy
> > engine
> > to get all resources off the node.
> > 
> > Once all nodes have the "shutdown" node attribute set, there is
> > nowhere
> > left for resources to run, so they will be stopped rather than
> > migrated. But if the resources are quicker than the attribute
> > setting,
> > they can migrate before that happens.
> > 
> > pcs doesn't issue a standby for the reasons discussed elsewhere in
> > the
> > thread.
> > 
> > To get a true atomic shutdown, we'd have to introduce a new crmd
> > request for "shutdown all" that would result in the "shutdown"
> > attribute being set for all nodes in one CIB modification.
> 
> Does it means we could set the shutdown node attribute on all nodes
> by hands
> using cibadmin?
> 
> I suppose this would force the CRM to compute the shutdown of
> everything in only
> one transition, isn't it?

"shutdown" is a transient attribute, so it must be managed by attrd,
rather than the CIB directly.

It should be theoretically possible to set a dampening on shutdown and
then set the attribute for all nodes within the dampening window, so
attrd writes it out in one go. (The value of "shutdown" must be the
Unix epoch timestamp at the time of request.)

However, I see two problems: delaying shutdown (which dampening would
do) sounds like it might have unintended consequences if shutdown is
requested by some other means; and more importantly, "shutdown" is an
internal attribute meant to be managed only by the DC as part of a
shutdown sequence, so it wouldn't be guaranteed to work across future
versions.
-- 
Ken Gaillot <kgaillot at redhat.com>