[ClusterLabs] Recovering after split-brain

Ken Gaillot kgaillot at redhat.com
Tue Jun 21 13:54:13 EDT 2016


On 06/21/2016 12:27 PM, Dimitri Maziuk wrote:
> On 06/21/2016 12:13 PM, Andrei Borzenkov wrote:
> 
>> You should not run pacemaker without some sort of fencing. This
>> need not be network-controlled power socket (and tiebreaker is
>> not directly related to fencing).

Fencing is a best practice for high availability, regardless of what
software is used. There is simply no other way to guarantee an
unresponsive node is not corrupting shared resources.

However, Pacemaker does not require it.

> Yes it can be sysadmin-controlled power socket. It has to be a
> power socket, if you don't trust me, read Dejan's list of fencing
> devices.

Perfection is not a possibility in this world, not even with a power
socket. It's just a matter of getting the best possible/practical
coverage for failure scenarios.

Unresponsive nodes that nevertheless can write to a shared disk, or
advertise an IP address on a network, etc., are a relatively common
failure scenario.

You are correct that most fence devices do some variation of cutting
power -- IPMI, blade centers, intelligent PDUs, etc. -- and it is the
most reliable method. But some people prefer to cut network/disk
access instead, and Pacemaker also supports watchdog devices for
(relatively) reliable self-fencing.

> Tiebreaking is directly related to figuring out which of the two
> nodes is to be fenced. because neither of them can tell on its
> own.

With two-node Pacemaker clusters, this is usually handled by
configuring separate fence devices for each node, and configuring a
delay on one of them. One node shoots faster.

>> I fail to see how heartbeat makes any difference here, sorry.
> 
> Third node and remote-controlled PDU were not a requirement for 
> haresources mode. If I wanted to run it so that when it breaks I
> get to keep the pieces, I could.

That's an option with Pacemaker as well. Two-node clusters are by far
the most popular Pacemaker configuration. While most HA professionals
recommend fencing, and most companies that sell enterprise support
require it in order to support a cluster, Pacemaker itself does not
require it.

The most obvious Pacemaker parameter here is stonith-enabled, which
defaults to true but can be set to false. There are also numerous
parameters such as on-fail, multiple-active, requires,
migration-threshold, and failure-timeout, which provide fine control
over failure response.

Fencing was just as much a recommended practice with pre-pacemaker
heartbeat as it is now.




More information about the Users mailing list