[ClusterLabs] questions about startup fencing

Wed Nov 29 10:54:49 EST 2017

On Wed, 2017-11-29 at 14:22 +0000, Adam Spiers wrote:
> Hi all,
> 
> A colleague has been valiantly trying to help me belatedly learn
> about
> the intricacies of startup fencing, but I'm still not fully
> understanding some of the finer points of the behaviour.
> 
> The documentation on the "startup-fencing" option[0] says
> 
>     Advanced Use Only: Should the cluster shoot unseen nodes? Not
>     using the default is very unsafe!
> 
> and that it defaults to TRUE, but doesn't elaborate any further:
> 
>     https://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacema
> ker_Explained/s-cluster-options.html
> 
> Let's imagine the following scenario:
> 
> - We have a 5-node cluster, with all nodes running cleanly.
> 
> - The whole cluster is shut down cleanly.
> 
> - The whole cluster is then started up again.  (Side question: what
>   happens if the last node to shut down is not the first to start up?
>   How will the cluster ensure it has the most recent version of the
>   CIB?  Without that, how would it know whether the last man standing
>   was shut down cleanly or not?)

Of course, the cluster can't know what CIB version nodes it doesn't see
have, so if a set of nodes is started with an older version, it will go
with that.

However, a node can't do much without quorum, so it would be difficult
to get in a situation where CIB changes were made with quorum before
shutdown, but none of those nodes are present at the next start-up with
quorum.

In any case, when a new node joins a cluster, the nodes do compare CIB
versions. If the new node has a newer CIB, the cluster will use it. If
other changes have been made since then, the newest CIB wins, so one or
the other's changes will be lost.

Whether missing nodes were shut down cleanly or not relates to your
next question ...

> - 4 of the nodes boot up fine and rejoin the cluster within the
>   dc-deadtime interval, foruming a quorum, but the 5th doesn't.
> 
> IIUC, with startup-fencing enabled, this will result in that 5th node
> automatically being fenced.  If I'm right, is that really *always*
> necessary?

It's always safe. :-) As you mentioned, if the missing node was the
last one alive in the previous run, the cluster can't know whether it
shut down cleanly or not. Even if the node was known to shut down
cleanly in the last run, the cluster still can't know whether the node
was started since then and is now merely unreachable. So, fencing is
necessary to ensure it's not accessing resources.

The same scenario is why a single node can't have quorum at start-up in
a cluster with "two_node" set. Both nodes have to see each other at
least once before they can assume it's safe to do anything.

> Let's suppose further that the cluster configuration is such that no
> stateful resources which could potentially conflict with other nodes
> will ever get launched on that 5th node.  For example it might only
> host stateless clones, or resources with require=nothing set, or it
> might not even host any resources at all due to some temporary
> constraints which have been applied.
> 
> In those cases, what is to be gained from fencing?  The only thing I
> can think of is that using (say) IPMI to power-cycle the node *might*
> fix whatever issue was preventing it from joining the cluster.  Are
> there any other reasons for fencing in this case?  It wouldn't help
> avoid any data corruption, at least.

Just because constraints are telling the node it can't run a resource
doesn't mean the node isn't malfunctioning and running it anyway. If
the node can't tell us it's OK, we have to assume it's not.

> Now let's imagine the same scenario, except rather than a clean full
> cluster shutdown, all nodes were affected by a power cut, but also
> this time the whole cluster is configured to *only* run stateless
> clones, so there is no risk of conflict between two nodes
> accidentally
> running the same resource.  On startup, the 4 nodes in the quorum
> have
> no way of knowing that the 5th node was also affected by the power
> cut, so in theory from their perspective it could still be running a
> stateless clone.  Again, is there anything to be gained from fencing
> the 5th node once it exceeds the dc-deadtime threshold for joining,
> other than the chance that a reboot might fix whatever was preventing
> it from joining, and get the cluster back to full strength?

If a cluster runs only services that have no potential to conflict,
then you don't need a cluster. :-)

Unique clones require communication even if they're stateless (think
IPaddr2). I'm pretty sure even some anonymous stateless clones require
communication to avoid issues.

> Also, when exactly does the dc-deadtime timer start ticking?
> Is it reset to zero after a node is fenced, so that potentially that
> node could go into a reboot loop if dc-deadtime is set too low?

A node's crmd starts the timer at start-up and whenever a new election
starts, and is stopped when the DC makes it a join offer. I don't think
it ever reboots though, I think it just starts a new election. So, you
can get into an election loop, but I think network conditions would
have to be pretty severe.

See also: https://bugs.clusterlabs.org/show_bug.cgi?id=5310

> The same questions apply if this troublesome node was actually a
> remote node running pacemaker_remoted, rather than the 5th node in
> the
> cluster.

Remote nodes don't join at the crmd level as cluster nodes do, so they
don't "start up" in the same sense, and start-up fencing doesn't apply
to them. Instead, the cluster initiates the connection when called for
(I don't remember for sure whether it fences the remote node if the
connection fails, but that would make sense).

> I have an uncomfortable feeling that I'm missing something obvious,
> probably due to the documentation's warning that "Not using the
> default [for startup-fencing] is very unsafe!"  Or is it only unsafe
> when the resource which exceeded dc-deadtime on startup could
> potentially be running a stateful resource which the cluster now
> wants
> to restart elsewhere?  If that's the case, would it be possible to
> optionally limit startup fencing to when it's really needed?
> 
> Thanks for any light you can shed!

There's no automatic mechanism to know that, but if you know before a
particular start that certain nodes are really down and are staying
that way, you can disable start-up fencing in the configuration on
disk, before starting the other nodes, then re-enable it once
everything is back to normal.
-- 
Ken Gaillot <kgaillot at redhat.com>