[ClusterLabs] questions about startup fencing

Thu Nov 30 06:58:28 EST 2017

Ken Gaillot <kgaillot at redhat.com> wrote:
>On Wed, 2017-11-29 at 14:22 +0000, Adam Spiers wrote:
>> Hi all,
>>
>> A colleague has been valiantly trying to help me belatedly learn
>> about
>> the intricacies of startup fencing, but I'm still not fully
>> understanding some of the finer points of the behaviour.
>>
>> The documentation on the "startup-fencing" option[0] says
>>
>>     Advanced Use Only: Should the cluster shoot unseen nodes? Not
>>     using the default is very unsafe!
>>
>> and that it defaults to TRUE, but doesn't elaborate any further:
>>
>>     https://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacema
>> ker_Explained/s-cluster-options.html
>>
>> Let's imagine the following scenario:
>>
>> - We have a 5-node cluster, with all nodes running cleanly.
>>
>> - The whole cluster is shut down cleanly.
>>
>> - The whole cluster is then started up again.  (Side question: what
>>   happens if the last node to shut down is not the first to start up?
>>   How will the cluster ensure it has the most recent version of the
>>   CIB?  Without that, how would it know whether the last man standing
>>   was shut down cleanly or not?)
>
>Of course, the cluster can't know what CIB version nodes it doesn't see
>have, so if a set of nodes is started with an older version, it will go
>with that.

Right, that's what I expected.

>However, a node can't do much without quorum, so it would be difficult
>to get in a situation where CIB changes were made with quorum before
>shutdown, but none of those nodes are present at the next start-up with
>quorum.
>
>In any case, when a new node joins a cluster, the nodes do compare CIB
>versions. If the new node has a newer CIB, the cluster will use it. If
>other changes have been made since then, the newest CIB wins, so one or
>the other's changes will be lost.

Ahh, that's interesting.  Based on reading

    https://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/ch03.html#_cib_properties

whichever node has the highest (admin_epoch, epoch, num_updates) tuple
will win, so normally in this scenario it would be the epoch which
decides it, i.e. whichever node had the most changes since the last
time the conflicting nodes shared the same config - right?

And if that would choose the wrong node, admin_epoch can be set
manually to override that decision?

>Whether missing nodes were shut down cleanly or not relates to your
>next question ...
>
>> - 4 of the nodes boot up fine and rejoin the cluster within the
>>   dc-deadtime interval, foruming a quorum, but the 5th doesn't.
>>
>> IIUC, with startup-fencing enabled, this will result in that 5th node
>> automatically being fenced.  If I'm right, is that really *always*
>> necessary?
>
>It's always safe. :-) As you mentioned, if the missing node was the
>last one alive in the previous run, the cluster can't know whether it
>shut down cleanly or not. Even if the node was known to shut down
>cleanly in the last run, the cluster still can't know whether the node
>was started since then and is now merely unreachable. So, fencing is
>necessary to ensure it's not accessing resources.

I get that, but I was questioning the "necessary to ensure it's not
accessing resources" part of this statement.  My point is that
sometimes this might be overkill, because sometimes we might be able to
discern through other methods that there are no resources we need to
worry about potentially conflicting with what we want to run.  That's
why I gave the stateless clones example.

>The same scenario is why a single node can't have quorum at start-up in
>a cluster with "two_node" set. Both nodes have to see each other at
>least once before they can assume it's safe to do anything.

Yep.

>> Let's suppose further that the cluster configuration is such that no
>> stateful resources which could potentially conflict with other nodes
>> will ever get launched on that 5th node.  For example it might only
>> host stateless clones, or resources with require=nothing set, or it
>> might not even host any resources at all due to some temporary
>> constraints which have been applied.
>>
>> In those cases, what is to be gained from fencing?  The only thing I
>> can think of is that using (say) IPMI to power-cycle the node *might*
>> fix whatever issue was preventing it from joining the cluster.  Are
>> there any other reasons for fencing in this case?  It wouldn't help
>> avoid any data corruption, at least.
>
>Just because constraints are telling the node it can't run a resource
>doesn't mean the node isn't malfunctioning and running it anyway. If
>the node can't tell us it's OK, we have to assume it's not.

Sure, but even if it *is* running it, if it's not conflicting with
anything or doing any harm, is it really always better to fence
regardless?

Disclaimer: to a certain extent I'm playing devil's advocate here to
stimulate a closer (re-)examination of the axiom we've grown so used
to over the years that if we don't know what a node is doing, we
should fence it.  I'm not necessarily arguing that fencing is wrong
here, but I think it's healthy to occasionally go back to first
principles and re-question why we are doing things a certain way, to
make sure that the original assumptions still hold true.  I'm familiar
with the pain that our customers experience when nodes are fenced for
less than very compelling reasons, so I think it's worth looking for
opportunities to reduce fencing to when it's really needed.

>> Now let's imagine the same scenario, except rather than a clean full
>> cluster shutdown, all nodes were affected by a power cut, but also
>> this time the whole cluster is configured to *only* run stateless
>> clones, so there is no risk of conflict between two nodes
>> accidentally
>> running the same resource.  On startup, the 4 nodes in the quorum
>> have
>> no way of knowing that the 5th node was also affected by the power
>> cut, so in theory from their perspective it could still be running a
>> stateless clone.  Again, is there anything to be gained from fencing
>> the 5th node once it exceeds the dc-deadtime threshold for joining,
>> other than the chance that a reboot might fix whatever was preventing
>> it from joining, and get the cluster back to full strength?
>
>If a cluster runs only services that have no potential to conflict,
>then you don't need a cluster. :-)

True :-)  Again as devil's advocate this scenario could be extended to
include remote nodes which *do* run resources which could conflict
(such as compute nodes), and in that case running stateless clones
(only) on the core cluster could be justified simply on the grounds
that we need Pacemaker for the remotes anyway, so we might as well use
it for the stateless clones rather than introducing keepalived as yet
another component ... but this is starting to get hypothetical, so
it's perhaps not worth spending energy discussing that tangent ;-)

>Unique clones require communication even if they're stateless (think
>IPaddr2).

Well yeah, IPaddr2 is arguably stateful since there are changing ARP
tables involved :-)

>I'm pretty sure even some anonymous stateless clones require
>communication to avoid issues.

Fair enough.

>> Also, when exactly does the dc-deadtime timer start ticking?
>> Is it reset to zero after a node is fenced, so that potentially that
>> node could go into a reboot loop if dc-deadtime is set too low?
>
>A node's crmd starts the timer at start-up and whenever a new election
>starts, and is stopped when the DC makes it a join offer.

That's surprising - I would have expected it to be the other way
around, i.e. that the timer doesn't run on the node which is joining,
but one of the nodes already in the cluster (e.g. the DC).  Otherwise
how can fencing of that node be triggered if the node takes too long
to join?

>I don't think it ever reboots though, I think it just starts a new
>election.

Maybe we're talking at cross-purposes?  By "reboot loop", I was asking
if the node which fails to join could end up getting endlessly fenced:
join timeout -> fenced -> reboots -> join timeout -> fenced -> ... etc.

>So, you can get into an election loop, but I think network conditions
>would have to be pretty severe.

Yeah, that sounds like a different type of loop to the one I was
imagining.

>> The same questions apply if this troublesome node was actually a
>> remote node running pacemaker_remoted, rather than the 5th node in
>> the
>> cluster.
>
>Remote nodes don't join at the crmd level as cluster nodes do, so they
>don't "start up" in the same sense

Sure, they establish a TCP connection via pacemaker_remoted when the
remote resource is starting.

>and start-up fencing doesn't apply to them.  Instead, the cluster
>initiates the connection when called for (I don't remember for sure
>whether it fences the remote node if the connection fails, but that
>would make sense).

Hrm, that's not what Yan said, and it's not what my L3 colleagues are
reporting either ;-)  I've been told (but not yet verified myself)
that if a remote resource's start operation times out (e.g. due to
the remote node not being up yet), the remote will get fenced.
But I see Yan has already replied with additional details on this.

>> I have an uncomfortable feeling that I'm missing something obvious,
>> probably due to the documentation's warning that "Not using the
>> default [for startup-fencing] is very unsafe!"  Or is it only
>> unsafe when the resource which exceeded dc-deadtime on startup
>> could potentially be running a stateful resource which the cluster
>> now wants to restart elsewhere?  If that's the case, would it be
>> possible to optionally limit startup fencing to when it's really
>> needed?
>>
>> Thanks for any light you can shed!
>
>There's no automatic mechanism to know that, but if you know before a
>particular start that certain nodes are really down and are staying
>that way, you can disable start-up fencing in the configuration on
>disk, before starting the other nodes, then re-enable it once
>everything is back to normal.

Ahah!  That's the kind of tip I was looking for, thanks :-)  So you
mean by editing the CIB XML directly?  Would disabling startup-fencing
manually this way require a concurrent update of the epoch?