[ClusterLabs] questions about startup fencing

Wed Dec 6 12:42:48 EST 2017

Ken Gaillot <kgaillot at redhat.com> wrote: 
>On Thu, 2017-11-30 at 11:58 +0000, Adam Spiers wrote: 
>>Ken Gaillot <kgaillot at redhat.com> wrote: 
>>>On Wed, 2017-11-29 at 14:22 +0000, Adam Spiers wrote: 

[snipped]

>>>>Let's suppose further that the cluster configuration is such 
>>>>that no stateful resources which could potentially conflict 
>>>>with other nodes will ever get launched on that 5th node.  For 
>>>>example it might only host stateless clones, or resources with 
>>>>require=nothing set, or it might not even host any resources at 
>>>>all due to some temporary constraints which have been applied. 
>>>>
>>>>In those cases, what is to be gained from fencing?  The only 
>>>>thing I can think of is that using (say) IPMI to power-cycle 
>>>>the node *might* fix whatever issue was preventing it from 
>>>>joining the cluster.  Are there any other reasons for fencing 
>>>>in this case?  It wouldn't help avoid any data corruption, at 
>>>>least. 
>>>
>>>Just because constraints are telling the node it can't run a 
>>>resource doesn't mean the node isn't malfunctioning and running 
>>>it anyway.  If the node can't tell us it's OK, we have to assume 
>>>it's not. 
>>
>>Sure, but even if it *is* running it, if it's not conflicting with 
>>anything or doing any harm, is it really always better to fence 
>>regardless? 
>
>There's a resource meta-attribute "requires" that says what a resource 
>needs to start. If it can't do any harm if it runs awry, you can set 
>requires="quorum" (or even "nothing"). 
>
>So, that's sort of a way to let the cluster know that, but it doesn't 
>currently do what you're suggesting, since start-up fencing is purely 
>about the node and not about the resources. I suppose if the cluster 
>had no resources requiring fencing (or, to push it further, no such 
>resources that will be probed on that node), we could disable start-up 
>fencing, but that's not done currently. 

Yeah, that's the kind of thing I was envisaging. 

>>Disclaimer: to a certain extent I'm playing devil's advocate here to 
>>stimulate a closer (re-)examination of the axiom we've grown so used 
>>to over the years that if we don't know what a node is doing, we 
>>should fence it.  I'm not necessarily arguing that fencing is wrong 
>>here, but I think it's healthy to occasionally go back to first 
>>principles and re-question why we are doing things a certain way, to 
>>make sure that the original assumptions still hold true.  I'm 
>>familiar with the pain that our customers experience when nodes are 
>>fenced for less than very compelling reasons, so I think it's worth 
>>looking for opportunities to reduce fencing to when it's really 
>>needed.
>
>The fundamental purpose of a high-availability cluster is to keep the 
>desired service functioning, above all other priorities (including, 
>unfortunately, making sysadmins' lives easier). 
>
>If a service requires an HA cluster, it's a safe bet it will have 
>problems in a split-brain situation (otherwise, why bother with the 
>overhead). Even something as simple as an IP address will render a 
>service useless if it's brought up on two machines on a network. 
>
>Fencing is really the only hammer we have in that situation. At that 
>point, we have zero information about what the node is doing. If it's 
>powered off (or cut off from disk/network), we know it's not doing 
>anything.
>
>Fencing may not always help the situation, but it's all we've got. 

Sure, but I'm not (necessarily) even talking about a split-brain 
situation.  For example what if a cluster with remote nodes is shut 
down cleanly, and then all the core nodes boot up cleanly but none of 
the remote nodes are powered on till hours or even days later? 

If I understand Yan correctly, in this situation all the remotes will 
be marked as needing fencing, and this is the bit that doesn't make 
sense to me.  If Pacemaker can't reach *any* remotes, it can't start 
any resources on those remotes, so (in the case where resources are 
partitioned cleanly into those which run on remotes vs. those which 
don't) there is no danger of any concurrency violation.  So fencing 
remotes before you can use them is totally pointless.  Surely fencing 
of node A should only happen when Pacemaker is ready to start resource 
X on node B which might already be running on node A.  But if no such 
node B exists then fencing is overkill.  It would be better to wait 
until the first remote joins the cluster, at which point Pacemaker can 
assess its current state and decide the best course of action. 
Otherwise it's like cutting your nose to spite your face. 

In fact, in the particular scenario which caused me to trigger this 
whole discussion, I suspect the above also applies even if some 
remotes joined the newly booted cluster quickly whilst others still 
take hours or days to boot - because in that scenario it is 
additionally safe to assume that none of the resources managed on 
those remotes by pacemaker_remoted would ever be started by anything 
other than pacemaker_remoted, since a) the whole system is configured 
automatically in a way which ensures the managed services won't 
automatically start at boot via systemd, and b) if someone started 
them manually, they would invalidate the warranty on that cluster ;-) 
Therefore we know that if a remote node has not yet joined the newly 
booted cluster, it can't be running anything which would conflict with 
the other remotes. 

>We give the user a good bit of control over fencing policies: corosync 
>tuning, stonith-enabled, startup-fencing, no-quorum-policy, requires, 
>on-fail, and the choice of fence agent. It can be a challenge for a new 
>user to know all the knobs to turn, but HA is kind of unavoidably 
>complex.

Indeed.  I just haven't figured out how to configure the cluster for 
the above scenario yet, so that it doesn't always fence latecomer 
remote nodes. 

[snipped]

>>>>Also, when exactly does the dc-deadtime timer start ticking? 
>>>>Is it reset to zero after a node is fenced, so that potentially 
>>>>that node could go into a reboot loop if dc-deadtime is set too 
>>>>low?
>>>
>>>A node's crmd starts the timer at start-up and whenever a new 
>>>election starts, and is stopped when the DC makes it a join 
>>>offer.
>>
>>That's surprising - I would have expected it to be the other way 
>>around, i.e. that the timer doesn't run on the node which is joining, 
>>but one of the nodes already in the cluster (e.g. the DC).  Otherwise 
>>how can fencing of that node be triggered if the node takes too long 
>>to join?
>>
>>>I don't think it ever reboots though, I think it just starts a new 
>>>election. 
>>
>>Maybe we're talking at cross-purposes?  By "reboot loop", I was 
>>asking if the node which fails to join could end up getting 
>>endlessly fenced: join timeout -> fenced -> reboots -> join timeout 
>>->fenced -> ...  etc. 
>
>startup-fencing and dc-deadtime don't have anything to do with each 
>other.
>
>There are two separate joins: the node joins at the corosync layer, and 
>then its crmd joins to the other crmd's at the pacemaker layer. One of 
>the crmd's is then elected DC. 
>
>startup-fencing kicks in if the cluster has quorum and the DC sees no 
>node status in the CIB for a node. Node status will be recorded in the 
>CIB once it joins at the corosync layer. So, all nodes have until 
>quorum is reached, a DC is elected, and the DC invokes the policy 
>engine, to join at the cluster layer, else they will be shot. (And at 
>that time, their status is known and recorded as dead.) This only 
>happens when the cluster first starts, and is the only way to handle 
>split-brain at start-up. 
>
>dc-deadtime is for the DC election. When a node joins an existing 
>cluster, it expects the existing DC to make it a membership offer (at 
>the pacemaker layer). If that doesn't happen within dc-deadtime, the 
>node asks for a new DC election. The idea is that the DC may be having 
>trouble that hasn't been detected yet. Similarly, whenever a new 
>election is called, all of the nodes expect a join offer from whichever 
>node is elected DC, and again they call a new election if that doesn't 
>happen in dc-deadtime. 

Ahah OK thanks, that's super helpful!  I don't suppose it's documented 
anywhere?  I didn't find it in Pacemaker Explained, at least. 

[snipped]

>>>>I have an uncomfortable feeling that I'm missing something 
>>>>obvious, probably due to the documentation's warning that "Not 
>>>>using the default [for startup-fencing] is very unsafe!"  Or is 
>>>>it only unsafe when the resource which exceeded dc-deadtime on 
>>>>startup could potentially be running a stateful resource which 
>>>>the cluster now wants to restart elsewhere?  If that's the 
>>>>case, would it be possible to optionally limit startup fencing 
>>>>to when it's really needed? 
>>>>
>>>>Thanks for any light you can shed! 
>>>
>>>There's no automatic mechanism to know that, but if you know 
>>>before a particular start that certain nodes are really down and 
>>>are staying that way, you can disable start-up fencing in the 
>>>configuration on disk, before starting the other nodes, then 
>>>re-enable it once everything is back to normal. 
>>
>>Ahah!  That's the kind of tip I was looking for, thanks :-)  So you 
>>mean by editing the CIB XML directly?  Would disabling startup- 
>>fencing manually this way require a concurrent update of the epoch? 
>
>You can edit the CIB on disk when the cluster is down, but you have to 
>go about it carefully. 
>
>Rather than edit it directly, you can use 
>CIB_file=/var/lib/pacemaker/cib/cib.xml when invoking cibadmin (or your 
>favorite higher-level tool). cibadmin will update the hash that 
>pacemaker uses to verify the CIB's integrity. Alternatively, you can 
>remove *everything* in /var/lib/pacemaker/cib except cib.xml, then edit 
>it directly. 
>
>Updating the admin epoch is a good idea if you want to be sure your 
>edited CIB wins, although starting that node first is also good enough. 

Again, great info which deserves to be documented if it isn't already ;-) 

Thanks a lot for the really helpful replies!