[ClusterLabs] questions about startup fencing

Fri Dec 1 22:29:19 UTC 2017

On Fri, 2017-12-01 at 16:21 -0600, Ken Gaillot wrote:
> On Thu, 2017-11-30 at 11:58 +0000, Adam Spiers wrote:
> > Ken Gaillot <kgaillot at redhat.com> wrote:
> > > On Wed, 2017-11-29 at 14:22 +0000, Adam Spiers wrote:
> > > > Hi all,
> > > > 
> > > > A colleague has been valiantly trying to help me belatedly
> > > > learn
> > > > about
> > > > the intricacies of startup fencing, but I'm still not fully
> > > > understanding some of the finer points of the behaviour.
> > > > 
> > > > The documentation on the "startup-fencing" option[0] says
> > > > 
> > > >     Advanced Use Only: Should the cluster shoot unseen nodes?
> > > > Not
> > > >     using the default is very unsafe!
> > > > 
> > > > and that it defaults to TRUE, but doesn't elaborate any
> > > > further:
> > > > 
> > > >     https://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/
> > > > Pa
> > > > cema
> > > > ker_Explained/s-cluster-options.html
> > > > 
> > > > Let's imagine the following scenario:
> > > > 
> > > > - We have a 5-node cluster, with all nodes running cleanly.
> > > > 
> > > > - The whole cluster is shut down cleanly.
> > > > 
> > > > - The whole cluster is then started up again.  (Side question:
> > > > what
> > > >   happens if the last node to shut down is not the first to
> > > > start
> > > > up?
> > > >   How will the cluster ensure it has the most recent version of
> > > > the
> > > >   CIB?  Without that, how would it know whether the last man
> > > > standing
> > > >   was shut down cleanly or not?)
> > > 
> > > Of course, the cluster can't know what CIB version nodes it
> > > doesn't
> > > see
> > > have, so if a set of nodes is started with an older version, it
> > > will go
> > > with that.
> > 
> > Right, that's what I expected.
> > 
> > > However, a node can't do much without quorum, so it would be
> > > difficult
> > > to get in a situation where CIB changes were made with quorum
> > > before
> > > shutdown, but none of those nodes are present at the next start-
> > > up
> > > with
> > > quorum.
> > > 
> > > In any case, when a new node joins a cluster, the nodes do
> > > compare
> > > CIB
> > > versions. If the new node has a newer CIB, the cluster will use
> > > it.
> > > If
> > > other changes have been made since then, the newest CIB wins, so
> > > one or
> > > the other's changes will be lost.
> > 
> > Ahh, that's interesting.  Based on reading
> > 
> >     https://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pace
> > ma
> > ker_Explained/ch03.html#_cib_properties
> > 
> > whichever node has the highest (admin_epoch, epoch, num_updates)
> > tuple
> > will win, so normally in this scenario it would be the epoch which
> > decides it, i.e. whichever node had the most changes since the last
> > time the conflicting nodes shared the same config - right?
> 
> Correct ... assuming the code for that is working properly, which I
> haven't confirmed :)
> 
> > 
> > And if that would choose the wrong node, admin_epoch can be set
> > manually to override that decision?
> 
> Correct again, with same caveat
> 
> > 
> > > Whether missing nodes were shut down cleanly or not relates to
> > > your
> > > next question ...
> > > 
> > > > - 4 of the nodes boot up fine and rejoin the cluster within the
> > > >   dc-deadtime interval, foruming a quorum, but the 5th doesn't.
> > > > 
> > > > IIUC, with startup-fencing enabled, this will result in that
> > > > 5th
> > > > node
> > > > automatically being fenced.  If I'm right, is that really
> > > > *always*
> > > > necessary?
> > > 
> > > It's always safe. :-) As you mentioned, if the missing node was
> > > the
> > > last one alive in the previous run, the cluster can't know
> > > whether
> > > it
> > > shut down cleanly or not. Even if the node was known to shut down
> > > cleanly in the last run, the cluster still can't know whether the
> > > node
> > > was started since then and is now merely unreachable. So, fencing
> > > is
> > > necessary to ensure it's not accessing resources.
> > 
> > I get that, but I was questioning the "necessary to ensure it's not
> > accessing resources" part of this statement.  My point is that
> > sometimes this might be overkill, because sometimes we might be
> > able
> > to
> > discern through other methods that there are no resources we need
> > to
> > worry about potentially conflicting with what we want to
> > run.  That's
> > why I gave the stateless clones example.
> > 
> > > The same scenario is why a single node can't have quorum at
> > > start-
> > > up in
> > > a cluster with "two_node" set. Both nodes have to see each other
> > > at
> > > least once before they can assume it's safe to do anything.
> > 
> > Yep.
> > 
> > > > Let's suppose further that the cluster configuration is such
> > > > that
> > > > no
> > > > stateful resources which could potentially conflict with other
> > > > nodes
> > > > will ever get launched on that 5th node.  For example it might
> > > > only
> > > > host stateless clones, or resources with require=nothing set,
> > > > or
> > > > it
> > > > might not even host any resources at all due to some temporary
> > > > constraints which have been applied.
> > > > 
> > > > In those cases, what is to be gained from fencing?  The only
> > > > thing I
> > > > can think of is that using (say) IPMI to power-cycle the node
> > > > *might*
> > > > fix whatever issue was preventing it from joining the
> > > > cluster.  Are
> > > > there any other reasons for fencing in this case?  It wouldn't
> > > > help
> > > > avoid any data corruption, at least.
> > > 
> > > Just because constraints are telling the node it can't run a
> > > resource
> > > doesn't mean the node isn't malfunctioning and running it anyway.
> > > If
> > > the node can't tell us it's OK, we have to assume it's not.
> > 
> > Sure, but even if it *is* running it, if it's not conflicting with
> > anything or doing any harm, is it really always better to fence
> > regardless?
> 
> There's a resource meta-attribute "requires" that says what a
> resource
> needs to start. If it can't do any harm if it runs awry, you can set
> requires="quorum" (or even "nothing").
> 
> So, that's sort of a way to let the cluster know that, but it doesn't
> currently do what you're suggesting, since start-up fencing is purely
> about the node and not about the resources. I suppose if the cluster
> had no resources requiring fencing (or, to push it further, no such
> resources that will be probed on that node), we could disable start-
> up
> fencing, but that's not done currently.
> 
> > Disclaimer: to a certain extent I'm playing devil's advocate here
> > to
> > stimulate a closer (re-)examination of the axiom we've grown so
> > used
> > to over the years that if we don't know what a node is doing, we
> > should fence it.  I'm not necessarily arguing that fencing is wrong
> > here, but I think it's healthy to occasionally go back to first
> > principles and re-question why we are doing things a certain way,
> > to
> > make sure that the original assumptions still hold true.  I'm
> > familiar
> > with the pain that our customers experience when nodes are fenced
> > for
> > less than very compelling reasons, so I think it's worth looking
> > for
> > opportunities to reduce fencing to when it's really needed.
> 
> The fundamental purpose of a high-availability cluster is to keep the
> desired service functioning, above all other priorities (including,
> unfortunately, making sysadmins' lives easier).
> 
> If a service requires an HA cluster, it's a safe bet it will have
> problems in a split-brain situation (otherwise, why bother with the
> overhead). Even something as simple as an IP address will render a
> service useless if it's brought up on two machines on a network.
> 
> Fencing is really the only hammer we have in that situation. At that
> point, we have zero information about what the node is doing. If it's
> powered off (or cut off from disk/network), we know it's not doing
> anything.
> 
> Fencing may not always help the situation, but it's all we've got.
> 
> We give the user a good bit of control over fencing policies:
> corosync
> tuning, stonith-enabled, startup-fencing, no-quorum-policy, requires,
> on-fail, and the choice of fence agent. It can be a challenge for a
> new
> user to know all the knobs to turn, but HA is kind of unavoidably
> complex.
> 
> > > > Now let's imagine the same scenario, except rather than a clean
> > > > full
> > > > cluster shutdown, all nodes were affected by a power cut, but
> > > > also
> > > > this time the whole cluster is configured to *only* run
> > > > stateless
> > > > clones, so there is no risk of conflict between two nodes
> > > > accidentally
> > > > running the same resource.  On startup, the 4 nodes in the
> > > > quorum
> > > > have
> > > > no way of knowing that the 5th node was also affected by the
> > > > power
> > > > cut, so in theory from their perspective it could still be
> > > > running a
> > > > stateless clone.  Again, is there anything to be gained from
> > > > fencing
> > > > the 5th node once it exceeds the dc-deadtime threshold for
> > > > joining,
> > > > other than the chance that a reboot might fix whatever was
> > > > preventing
> > > > it from joining, and get the cluster back to full strength?
> > > 
> > > If a cluster runs only services that have no potential to
> > > conflict,
> > > then you don't need a cluster. :-)
> > 
> > True :-)  Again as devil's advocate this scenario could be extended
> > to
> > include remote nodes which *do* run resources which could conflict
> > (such as compute nodes), and in that case running stateless clones
> > (only) on the core cluster could be justified simply on the grounds
> > that we need Pacemaker for the remotes anyway, so we might as well
> > use
> > it for the stateless clones rather than introducing keepalived as
> > yet
> > another component ... but this is starting to get hypothetical, so
> > it's perhaps not worth spending energy discussing that tangent ;-)
> > 
> > > Unique clones require communication even if they're stateless
> > > (think
> > > IPaddr2).
> > 
> > Well yeah, IPaddr2 is arguably stateful since there are changing
> > ARP
> > tables involved :-)
> > 
> > > I'm pretty sure even some anonymous stateless clones require
> > > communication to avoid issues.
> > 
> > Fair enough.
> > 
> > > > Also, when exactly does the dc-deadtime timer start ticking?
> > > > Is it reset to zero after a node is fenced, so that potentially
> > > > that
> > > > node could go into a reboot loop if dc-deadtime is set too low?
> > > 
> > > A node's crmd starts the timer at start-up and whenever a new
> > > election
> > > starts, and is stopped when the DC makes it a join offer.
> > 
> > That's surprising - I would have expected it to be the other way
> > around, i.e. that the timer doesn't run on the node which is
> > joining,
> > but one of the nodes already in the cluster (e.g. the
> > DC).  Otherwise
> > how can fencing of that node be triggered if the node takes too
> > long
> > to join?
> > 
> > > I don't think it ever reboots though, I think it just starts a
> > > new
> > > election.
> > 
> > Maybe we're talking at cross-purposes?  By "reboot loop", I was
> > asking
> > if the node which fails to join could end up getting endlessly
> > fenced:
> > join timeout -> fenced -> reboots -> join timeout -> fenced -> ...
> > etc.
> 
> startup-fencing and dc-deadtime don't have anything to do with each
> other.

I guess that's not quite accurate -- the first DC election at start-up
won't complete until dc-deadtime, so the DC won't be able to check for
start-up fencing until after then.

But a fence loop is not possible because one fencing is done, the node
has a known status. startup-fencing doesn't require that a node be
functional, only that its status is known.

> There are two separate joins: the node joins at the corosync layer,
> and
> then its crmd joins to the other crmd's at the pacemaker layer. One
> of
> the crmd's is then elected DC.
> 
> startup-fencing kicks in if the cluster has quorum and the DC sees no
> node status in the CIB for a node. Node status will be recorded in
> the
> CIB once it joins at the corosync layer. So, all nodes have until
> quorum is reached, a DC is elected, and the DC invokes the policy
> engine, to join at the cluster layer, else they will be shot. (And at
> that time, their status is known and recorded as dead.) This only
> happens when the cluster first starts, and is the only way to handle
> split-brain at start-up.
> 
> dc-deadtime is for the DC election. When a node joins an existing
> cluster, it expects the existing DC to make it a membership offer (at
> the pacemaker layer). If that doesn't happen within dc-deadtime, the
> node asks for a new DC election. The idea is that the DC may be
> having
> trouble that hasn't been detected yet. Similarly, whenever a new
> election is called, all of the nodes expect a join offer from
> whichever
> node is elected DC, and again they call a new election if that
> doesn't
> happen in dc-deadtime.
> 
> > > So, you can get into an election loop, but I think network
> > > conditions
> > > would have to be pretty severe.
> > 
> > Yeah, that sounds like a different type of loop to the one I was
> > imagining.
> > 
> > > > The same questions apply if this troublesome node was actually
> > > > a
> > > > remote node running pacemaker_remoted, rather than the 5th node
> > > > in
> > > > the
> > > > cluster.
> > > 
> > > Remote nodes don't join at the crmd level as cluster nodes do, so
> > > they
> > > don't "start up" in the same sense
> > 
> > Sure, they establish a TCP connection via pacemaker_remoted when
> > the
> > remote resource is starting.
> > 
> > > and start-up fencing doesn't apply to them.  Instead, the cluster
> > > initiates the connection when called for (I don't remember for
> > > sure
> > > whether it fences the remote node if the connection fails, but
> > > that
> > > would make sense).
> > 
> > Hrm, that's not what Yan said, and it's not what my L3 colleagues
> > are
> > reporting either ;-)  I've been told (but not yet verified myself)
> > that if a remote resource's start operation times out (e.g. due to
> > the remote node not being up yet), the remote will get fenced.
> > But I see Yan has already replied with additional details on this.
> 
> Yep I remembered wrong :)
> 
> > > > I have an uncomfortable feeling that I'm missing something
> > > > obvious,
> > > > probably due to the documentation's warning that "Not using the
> > > > default [for startup-fencing] is very unsafe!"  Or is it only
> > > > unsafe when the resource which exceeded dc-deadtime on startup
> > > > could potentially be running a stateful resource which the
> > > > cluster
> > > > now wants to restart elsewhere?  If that's the case, would it
> > > > be
> > > > possible to optionally limit startup fencing to when it's
> > > > really
> > > > needed?
> > > > 
> > > > Thanks for any light you can shed!
> > > 
> > > There's no automatic mechanism to know that, but if you know
> > > before
> > > a
> > > particular start that certain nodes are really down and are
> > > staying
> > > that way, you can disable start-up fencing in the configuration
> > > on
> > > disk, before starting the other nodes, then re-enable it once
> > > everything is back to normal.
> > 
> > Ahah!  That's the kind of tip I was looking for, thanks :-)  So you
> > mean by editing the CIB XML directly?  Would disabling startup-
> > fencing
> > manually this way require a concurrent update of the epoch?
> 
> You can edit the CIB on disk when the cluster is down, but you have
> to
> go about it carefully.
> 
> Rather than edit it directly, you can use
> CIB_file=/var/lib/pacemaker/cib/cib.xml when invoking cibadmin (or
> your
> favorite higher-level tool). cibadmin will update the hash that
> pacemaker uses to verify the CIB's integrity. Alternatively, you can
> remove *everything* in /var/lib/pacemaker/cib except cib.xml, then
> edit
> it directly.
> 
> Updating the admin epoch is a good idea if you want to be sure your
> edited CIB wins, although starting that node first is also good
> enough.
-- 
Ken Gaillot <kgaillot at redhat.com>