[ClusterLabs] questions about startup fencing

Fri Dec 1 23:21:27 CET 2017

On Thu, 2017-11-30 at 11:58 +0000, Adam Spiers wrote:
> Ken Gaillot <kgaillot at redhat.com> wrote:
> > On Wed, 2017-11-29 at 14:22 +0000, Adam Spiers wrote:
> > > Hi all,
> > > 
> > > A colleague has been valiantly trying to help me belatedly learn
> > > about
> > > the intricacies of startup fencing, but I'm still not fully
> > > understanding some of the finer points of the behaviour.
> > > 
> > > The documentation on the "startup-fencing" option[0] says
> > > 
> > >     Advanced Use Only: Should the cluster shoot unseen nodes? Not
> > >     using the default is very unsafe!
> > > 
> > > and that it defaults to TRUE, but doesn't elaborate any further:
> > > 
> > >     https://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pa
> > > cema
> > > ker_Explained/s-cluster-options.html
> > > 
> > > Let's imagine the following scenario:
> > > 
> > > - We have a 5-node cluster, with all nodes running cleanly.
> > > 
> > > - The whole cluster is shut down cleanly.
> > > 
> > > - The whole cluster is then started up again.  (Side question:
> > > what
> > >   happens if the last node to shut down is not the first to start
> > > up?
> > >   How will the cluster ensure it has the most recent version of
> > > the
> > >   CIB?  Without that, how would it know whether the last man
> > > standing
> > >   was shut down cleanly or not?)
> > 
> > Of course, the cluster can't know what CIB version nodes it doesn't
> > see
> > have, so if a set of nodes is started with an older version, it
> > will go
> > with that.
> 
> Right, that's what I expected.
> 
> > However, a node can't do much without quorum, so it would be
> > difficult
> > to get in a situation where CIB changes were made with quorum
> > before
> > shutdown, but none of those nodes are present at the next start-up
> > with
> > quorum.
> > 
> > In any case, when a new node joins a cluster, the nodes do compare
> > CIB
> > versions. If the new node has a newer CIB, the cluster will use it.
> > If
> > other changes have been made since then, the newest CIB wins, so
> > one or
> > the other's changes will be lost.
> 
> Ahh, that's interesting.  Based on reading
> 
>     https://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacema
> ker_Explained/ch03.html#_cib_properties
> 
> whichever node has the highest (admin_epoch, epoch, num_updates)
> tuple
> will win, so normally in this scenario it would be the epoch which
> decides it, i.e. whichever node had the most changes since the last
> time the conflicting nodes shared the same config - right?

Correct ... assuming the code for that is working properly, which I
haven't confirmed :)

> 
> And if that would choose the wrong node, admin_epoch can be set
> manually to override that decision?

Correct again, with same caveat

> 
> > Whether missing nodes were shut down cleanly or not relates to your
> > next question ...
> > 
> > > - 4 of the nodes boot up fine and rejoin the cluster within the
> > >   dc-deadtime interval, foruming a quorum, but the 5th doesn't.
> > > 
> > > IIUC, with startup-fencing enabled, this will result in that 5th
> > > node
> > > automatically being fenced.  If I'm right, is that really
> > > *always*
> > > necessary?
> > 
> > It's always safe. :-) As you mentioned, if the missing node was the
> > last one alive in the previous run, the cluster can't know whether
> > it
> > shut down cleanly or not. Even if the node was known to shut down
> > cleanly in the last run, the cluster still can't know whether the
> > node
> > was started since then and is now merely unreachable. So, fencing
> > is
> > necessary to ensure it's not accessing resources.
> 
> I get that, but I was questioning the "necessary to ensure it's not
> accessing resources" part of this statement.  My point is that
> sometimes this might be overkill, because sometimes we might be able
> to
> discern through other methods that there are no resources we need to
> worry about potentially conflicting with what we want to run.  That's
> why I gave the stateless clones example.
> 
> > The same scenario is why a single node can't have quorum at start-
> > up in
> > a cluster with "two_node" set. Both nodes have to see each other at
> > least once before they can assume it's safe to do anything.
> 
> Yep.
> 
> > > Let's suppose further that the cluster configuration is such that
> > > no
> > > stateful resources which could potentially conflict with other
> > > nodes
> > > will ever get launched on that 5th node.  For example it might
> > > only
> > > host stateless clones, or resources with require=nothing set, or
> > > it
> > > might not even host any resources at all due to some temporary
> > > constraints which have been applied.
> > > 
> > > In those cases, what is to be gained from fencing?  The only
> > > thing I
> > > can think of is that using (say) IPMI to power-cycle the node
> > > *might*
> > > fix whatever issue was preventing it from joining the
> > > cluster.  Are
> > > there any other reasons for fencing in this case?  It wouldn't
> > > help
> > > avoid any data corruption, at least.
> > 
> > Just because constraints are telling the node it can't run a
> > resource
> > doesn't mean the node isn't malfunctioning and running it anyway.
> > If
> > the node can't tell us it's OK, we have to assume it's not.
> 
> Sure, but even if it *is* running it, if it's not conflicting with
> anything or doing any harm, is it really always better to fence
> regardless?

There's a resource meta-attribute "requires" that says what a resource
needs to start. If it can't do any harm if it runs awry, you can set
requires="quorum" (or even "nothing").

So, that's sort of a way to let the cluster know that, but it doesn't
currently do what you're suggesting, since start-up fencing is purely
about the node and not about the resources. I suppose if the cluster
had no resources requiring fencing (or, to push it further, no such
resources that will be probed on that node), we could disable start-up
fencing, but that's not done currently.

> Disclaimer: to a certain extent I'm playing devil's advocate here to
> stimulate a closer (re-)examination of the axiom we've grown so used
> to over the years that if we don't know what a node is doing, we
> should fence it.  I'm not necessarily arguing that fencing is wrong
> here, but I think it's healthy to occasionally go back to first
> principles and re-question why we are doing things a certain way, to
> make sure that the original assumptions still hold true.  I'm
> familiar
> with the pain that our customers experience when nodes are fenced for
> less than very compelling reasons, so I think it's worth looking for
> opportunities to reduce fencing to when it's really needed.

The fundamental purpose of a high-availability cluster is to keep the
desired service functioning, above all other priorities (including,
unfortunately, making sysadmins' lives easier).

If a service requires an HA cluster, it's a safe bet it will have
problems in a split-brain situation (otherwise, why bother with the
overhead). Even something as simple as an IP address will render a
service useless if it's brought up on two machines on a network.

Fencing is really the only hammer we have in that situation. At that
point, we have zero information about what the node is doing. If it's
powered off (or cut off from disk/network), we know it's not doing
anything.

Fencing may not always help the situation, but it's all we've got.

We give the user a good bit of control over fencing policies: corosync
tuning, stonith-enabled, startup-fencing, no-quorum-policy, requires,
on-fail, and the choice of fence agent. It can be a challenge for a new
user to know all the knobs to turn, but HA is kind of unavoidably
complex.

> > > Now let's imagine the same scenario, except rather than a clean
> > > full
> > > cluster shutdown, all nodes were affected by a power cut, but
> > > also
> > > this time the whole cluster is configured to *only* run stateless
> > > clones, so there is no risk of conflict between two nodes
> > > accidentally
> > > running the same resource.  On startup, the 4 nodes in the quorum
> > > have
> > > no way of knowing that the 5th node was also affected by the
> > > power
> > > cut, so in theory from their perspective it could still be
> > > running a
> > > stateless clone.  Again, is there anything to be gained from
> > > fencing
> > > the 5th node once it exceeds the dc-deadtime threshold for
> > > joining,
> > > other than the chance that a reboot might fix whatever was
> > > preventing
> > > it from joining, and get the cluster back to full strength?
> > 
> > If a cluster runs only services that have no potential to conflict,
> > then you don't need a cluster. :-)
> 
> True :-)  Again as devil's advocate this scenario could be extended
> to
> include remote nodes which *do* run resources which could conflict
> (such as compute nodes), and in that case running stateless clones
> (only) on the core cluster could be justified simply on the grounds
> that we need Pacemaker for the remotes anyway, so we might as well
> use
> it for the stateless clones rather than introducing keepalived as yet
> another component ... but this is starting to get hypothetical, so
> it's perhaps not worth spending energy discussing that tangent ;-)
> 
> > Unique clones require communication even if they're stateless
> > (think
> > IPaddr2).
> 
> Well yeah, IPaddr2 is arguably stateful since there are changing ARP
> tables involved :-)
> 
> > I'm pretty sure even some anonymous stateless clones require
> > communication to avoid issues.
> 
> Fair enough.
> 
> > > Also, when exactly does the dc-deadtime timer start ticking?
> > > Is it reset to zero after a node is fenced, so that potentially
> > > that
> > > node could go into a reboot loop if dc-deadtime is set too low?
> > 
> > A node's crmd starts the timer at start-up and whenever a new
> > election
> > starts, and is stopped when the DC makes it a join offer.
> 
> That's surprising - I would have expected it to be the other way
> around, i.e. that the timer doesn't run on the node which is joining,
> but one of the nodes already in the cluster (e.g. the DC).  Otherwise
> how can fencing of that node be triggered if the node takes too long
> to join?
> 
> > I don't think it ever reboots though, I think it just starts a new
> > election.
> 
> Maybe we're talking at cross-purposes?  By "reboot loop", I was
> asking
> if the node which fails to join could end up getting endlessly
> fenced:
> join timeout -> fenced -> reboots -> join timeout -> fenced -> ...
> etc.

startup-fencing and dc-deadtime don't have anything to do with each
other.

There are two separate joins: the node joins at the corosync layer, and
then its crmd joins to the other crmd's at the pacemaker layer. One of
the crmd's is then elected DC.

startup-fencing kicks in if the cluster has quorum and the DC sees no
node status in the CIB for a node. Node status will be recorded in the
CIB once it joins at the corosync layer. So, all nodes have until
quorum is reached, a DC is elected, and the DC invokes the policy
engine, to join at the cluster layer, else they will be shot. (And at
that time, their status is known and recorded as dead.) This only
happens when the cluster first starts, and is the only way to handle
split-brain at start-up.

dc-deadtime is for the DC election. When a node joins an existing
cluster, it expects the existing DC to make it a membership offer (at
the pacemaker layer). If that doesn't happen within dc-deadtime, the
node asks for a new DC election. The idea is that the DC may be having
trouble that hasn't been detected yet. Similarly, whenever a new
election is called, all of the nodes expect a join offer from whichever
node is elected DC, and again they call a new election if that doesn't
happen in dc-deadtime.

> > So, you can get into an election loop, but I think network
> > conditions
> > would have to be pretty severe.
> 
> Yeah, that sounds like a different type of loop to the one I was
> imagining.
> 
> > > The same questions apply if this troublesome node was actually a
> > > remote node running pacemaker_remoted, rather than the 5th node
> > > in
> > > the
> > > cluster.
> > 
> > Remote nodes don't join at the crmd level as cluster nodes do, so
> > they
> > don't "start up" in the same sense
> 
> Sure, they establish a TCP connection via pacemaker_remoted when the
> remote resource is starting.
> 
> > and start-up fencing doesn't apply to them.  Instead, the cluster
> > initiates the connection when called for (I don't remember for sure
> > whether it fences the remote node if the connection fails, but that
> > would make sense).
> 
> Hrm, that's not what Yan said, and it's not what my L3 colleagues are
> reporting either ;-)  I've been told (but not yet verified myself)
> that if a remote resource's start operation times out (e.g. due to
> the remote node not being up yet), the remote will get fenced.
> But I see Yan has already replied with additional details on this.

Yep I remembered wrong :)

> > > I have an uncomfortable feeling that I'm missing something
> > > obvious,
> > > probably due to the documentation's warning that "Not using the
> > > default [for startup-fencing] is very unsafe!"  Or is it only
> > > unsafe when the resource which exceeded dc-deadtime on startup
> > > could potentially be running a stateful resource which the
> > > cluster
> > > now wants to restart elsewhere?  If that's the case, would it be
> > > possible to optionally limit startup fencing to when it's really
> > > needed?
> > > 
> > > Thanks for any light you can shed!
> > 
> > There's no automatic mechanism to know that, but if you know before
> > a
> > particular start that certain nodes are really down and are staying
> > that way, you can disable start-up fencing in the configuration on
> > disk, before starting the other nodes, then re-enable it once
> > everything is back to normal.
> 
> Ahah!  That's the kind of tip I was looking for, thanks :-)  So you
> mean by editing the CIB XML directly?  Would disabling startup-
> fencing
> manually this way require a concurrent update of the epoch?

You can edit the CIB on disk when the cluster is down, but you have to
go about it carefully.

Rather than edit it directly, you can use
CIB_file=/var/lib/pacemaker/cib/cib.xml when invoking cibadmin (or your
favorite higher-level tool). cibadmin will update the hash that
pacemaker uses to verify the CIB's integrity. Alternatively, you can
remove *everything* in /var/lib/pacemaker/cib except cib.xml, then edit
it directly.

Updating the admin epoch is a good idea if you want to be sure your
edited CIB wins, although starting that node first is also good enough.
-- 
Ken Gaillot <kgaillot at redhat.com>