[ClusterLabs] Antw: Re: Antw: Re: questions about startup fencing

Tue Dec 5 16:43:18 UTC 2017

On Tue, 05 Dec 2017 08:59:55 -0600
Ken Gaillot <kgaillot at redhat.com> wrote:

> On Tue, 2017-12-05 at 14:47 +0100, Ulrich Windl wrote:
> > > > > Tomas Jelinek <tojeline at redhat.com> schrieb am 04.12.2017 um
> > > > > 16:50 in Nachricht
> > 
> > <3e60579c-0f4d-1c32-70fc-d207e0654fbf at redhat.com>:
> > > Dne 4.12.2017 v 14:21 Jehan-Guillaume de Rorthais napsal(a):
> > > > On Mon, 4 Dec 2017 12:31:06 +0100
> > > > Tomas Jelinek <tojeline at redhat.com> wrote:
> > > > 
> > > > > Dne 4.12.2017 v 10:36 Jehan-Guillaume de Rorthais napsal(a):
> > > > > > On Fri, 01 Dec 2017 16:34:08 -0600
> > > > > > Ken Gaillot <kgaillot at redhat.com> wrote:
> > > > > >    
> > > > > > > On Thu, 2017-11-30 at 07:55 +0100, Ulrich Windl wrote:
> > > > > > > > 
> > > > > > > >       
> > > > > > > > > Kristoffer Gronlund <kgronlund at suse.com> wrote:
> > > > > > > > > > Adam Spiers <aspiers at suse.com> writes:
> > > > > > > > > >       
> > > > > > > > > > > - The whole cluster is shut down cleanly.
> > > > > > > > > > > 
> > > > > > > > > > > - The whole cluster is then started up
> > > > > > > > > > > again.  (Side question:
> > > > > > > > > > > what
> > > > > > > > > > >     happens if the last node to shut down is not
> > > > > > > > > > > the first to
> > > > > > > > > > > start up?
> > > > > > > > > > >     How will the cluster ensure it has the most
> > > > > > > > > > > recent version of
> > > > > > > > > > > the
> > > > > > > > > > >     CIB?  Without that, how would it know whether
> > > > > > > > > > > the last man
> > > > > > > > > > > standing
> > > > > > > > > > >     was shut down cleanly or not?)
> > > > > > > > > > 
> > > > > > > > > > This is my opinion, I don't really know what the
> > > > > > > > > > "official"
> > > > > > > > > > pacemaker
> > > > > > > > > > stance is: There is no such thing as shutting down a
> > > > > > > > > > cluster
> > > > > > > > > > cleanly. A
> > > > > > > > > > cluster is a process stretching over multiple nodes -
> > > > > > > > > > if they all
> > > > > > > > > > shut
> > > > > > > > > > down, the process is gone. When you start up again,
> > > > > > > > > > you
> > > > > > > > > > effectively have
> > > > > > > > > > a completely new cluster.
> > > > > > > > > 
> > > > > > > > > Sorry, I don't follow you at all here.  When you start
> > > > > > > > > the cluster
> > > > > > > > > up
> > > > > > > > > again, the cluster config from before the shutdown is
> > > > > > > > > still there.
> > > > > > > > > That's very far from being a completely new cluster :-)
> > > > > > > > 
> > > > > > > > The problem is you cannot "start the cluster" in
> > > > > > > > pacemaker; you can
> > > > > > > > only "start nodes". The nodes will come up one by one. As
> > > > > > > > opposed (as
> > > > > > > > I had said) to HP Sertvice Guard, where there is a
> > > > > > > > "cluster formation
> > > > > > > > timeout". That is, the nodes wait for the specified time
> > > > > > > > for the
> > > > > > > > cluster to "form". Then the cluster starts as a whole. Of
> > > > > > > > course that
> > > > > > > > only applies if the whole cluster was down, not if a
> > > > > > > > single node was
> > > > > > > > down.
> > > > > > > 
> > > > > > > I'm not sure what that would specifically entail, but I'm
> > > > > > > guessing we
> > > > > > > have some of the pieces already:
> > > > > > > 
> > > > > > > - Corosync has a wait_for_all option if you want the
> > > > > > > cluster to be
> > > > > > > unable to have quorum at start-up until every node has
> > > > > > > joined. I don't
> > > > > > > think you can set a timeout that cancels it, though.
> > > > > > > 
> > > > > > > - Pacemaker will wait dc-deadtime for the first DC election
> > > > > > > to
> > > > > > > complete. (if I understand it correctly ...)
> > > > > > > 
> > > > > > > - Higher-level tools can start or stop all nodes together
> > > > > > > (e.g. pcs has
> > > > > > > pcs cluster start/stop --all).
> > > > > > 
> > > > > > Based on this discussion, I have some questions about pcs:
> > > > > > 
> > > > > > * how is it shutting down the cluster when issuing "pcs
> > > > > > cluster stop
> > > > > > --all"?
> > > > > 
> > > > > First, it sends a request to each node to stop pacemaker. The
> > > > > requests
> > > > > are sent in parallel which prevents resources from being moved
> > > > > from node
> > > > > to node. Once pacemaker stops on all nodes, corosync is stopped
> > > > > on all
> > > > > nodes in the same manner.
> > > > 
> > > > What if for some external reasons one node is slower (load,
> > > > network, 
> > > 
> > > whatever)
> > > > than the others and start reacting ? Sending queries in parallel
> > > > doesn't
> > > > feels safe enough in regard with all the race conditions that can
> > > > occurs in 
> > > 
> > > the
> > > > same time.
> > > > 
> > > > Am I missing something ?
> > > > 
> > > 
> > > If a node gets the request later than others, some resources may
> > > be 
> > > moved to it before it starts shutting down pacemaker as well. Pcs
> > > waits 
> > 
> > I think that's impossible due to the ordering of corosync: If a
> > standby is issued, and a resource migration is the consequence, every
> > node will see the standby before it sees any other config change.
> > Right?
> 
> pcs doesn't issue a standby, just a shutdown.
> 
> When a node needs to shut down, it sends a shutdown request to the DC,
> which sets a "shutdown" node attribute, which tells the policy engine
> to get all resources off the node.
> 
> Once all nodes have the "shutdown" node attribute set, there is nowhere
> left for resources to run, so they will be stopped rather than
> migrated. But if the resources are quicker than the attribute setting,
> they can migrate before that happens.
> 
> pcs doesn't issue a standby for the reasons discussed elsewhere in the
> thread.
> 
> To get a true atomic shutdown, we'd have to introduce a new crmd
> request for "shutdown all" that would result in the "shutdown"
> attribute being set for all nodes in one CIB modification.

Does it means we could set the shutdown node attribute on all nodes by hands
using cibadmin?

I suppose this would force the CRM to compute the shutdown of everything in only
one transition, isn't it?