[ClusterLabs] Antw: Re: reproducible split brain
cwh at eml.cc
Thu Mar 17 19:30:23 EDT 2016
On Thu, Mar 17, 2016, at 06:24 PM, Ken Gaillot wrote:
> On 03/17/2016 05:10 PM, Christopher Harvey wrote:
> > If I ignore pacemaker's existence, and just run corosync, corosync
> > disagrees about node membership in the situation presented in the first
> > email. While it's true that stonith just happens to quickly correct the
> > situation after it occurs it still smells like a bug in the case where
> > corosync in used in isolation. Corosync is after all a membership and
> > total ordering protocol, and the nodes in the cluster are unable to
> > agree on membership.
> > The Totem protocol specifies a ring_id in the token passed in a ring.
> > Since all of the 3 nodes but one have formed a new ring with a new id
> > how is it that the single node can survive in a ring with no other
> > members passing a token with the old ring_id?
> > Are there network failure situations that can fool the Totem membership
> > protocol or is this an implementation problem? I don't see how it could
> > not be one or the other, and it's bad either way.
> Neither, really. In a split brain situation, there simply is not enough
> information for any protocol or implementation to reliably decide what
> to do. That's what fencing is meant to solve -- it provides the
> information that certain nodes are definitely not active.
> There's no way for either side of the split to know whether the opposite
> side is down, or merely unable to communicate properly. If the latter,
> it's possible that they are still accessing shared resources, which
> without proper communication, can lead to serious problems (e.g. data
> corruption of a shared volume).
The totem protocol is silent on the topic of fencing and resources, much
the way TCP is.
Please explain to me what needs to be fenced in a cluster without
resources where membership and total message ordering are the only
concern. If fencing were a requirement for membership and ordering,
wouldn't stonith be part of corosync and not pacemaker?
More information about the Users