[ClusterLabs] Antw: Re: reproducible split brain

Fri Mar 18 03:33:58 EDT 2016

Christopher,

> If I ignore pacemaker's existence, and just run corosync, corosync
> disagrees about node membership in the situation presented in the first
> email. While it's true that stonith just happens to quickly correct the
> situation after it occurs it still smells like a bug in the case where
> corosync in used in isolation. Corosync is after all a membership and
> total ordering protocol, and the nodes in the cluster are unable to
> agree on membership.
>
> The Totem protocol specifies a ring_id in the token passed in a ring.
> Since all of the 3 nodes but one have formed a new ring with a new id
> how is it that the single node can survive in a ring with no other
> members passing a token with the old ring_id?
>
> Are there network failure situations that can fool the Totem membership
> protocol or is this an implementation problem? I don't see how it could

The main problem (as you noted in original mail) is really about 
blocking only one direction (input one). This is called byzantine 
failure and it's something what corosync is unable to handle. Totem was 
simply never designed to solve byzantine failures.

Regards,
   Honza

> not be one or the other, and it's bad either way.
>
> On Thu, Mar 17, 2016, at 02:08 PM, Digimer wrote:
>> On 17/03/16 01:57 PM, vija ar wrote:
>>> root file system is fine ...
>>>
>>> but fencing is not a necessity a cluster shld function without it .. i
>>> see the issue with corosync which has all been .. a inherent way of not
>>> working neatly or smoothly ..
>>
>> Absolutely wrong.
>>
>> If you have a service that can run on both/all nodes at the same time
>> without coordination, you don't need a cluster, just run your services
>> everywhere.
>>
>> If that's not the case, then you need fencing so that the (surviving)
>> node(s) can be sure that they know where services are and are not
>> running.
>>
>>> for e.g. take an issue where the live node is hung in db cluster .. now
>>> db perspective transactions r not happening and tht is fine as the node
>>> is having some issue .. now there is no need to fence this hung node but
>>> just to switch over to passive one .. but tht doesnt happens and fencing
>>> takes place either by reboot or shut .. which further makes the DB dirty
>>> or far more than tht in non-recoverable state which wouldnt have happen
>>> if a normal switch to other node as in cluster would have happened ...
>>>
>>> i see fencing is not a solution its only required to forcefully take
>>> control which is not the case always
>>>
>>> On Thu, Mar 17, 2016 at 12:49 PM, Ulrich Windl
>>> <Ulrich.Windl at rz.uni-regensburg.de
>>> <mailto:Ulrich.Windl at rz.uni-regensburg.de>> wrote:
>>>
>>>      >>> Christopher Harvey <cwh at eml.cc> schrieb am 16.03.2016 um 21:04
>>>      in Nachricht
>>>      <1458158684.122207.551267810.11F73AB9 at webmail.messagingengine.com
>>>      <mailto:1458158684.122207.551267810.11F73AB9 at webmail.messagingengine.com>>:
>>>      [...]
>>>      >> > Would stonith solve this problem, or does this look like a bug?
>>>      >>
>>>      >> It should, that is its job.
>>>      >
>>>      > is there some log I can enable that would say
>>>      > "ERROR: hey, I would use stonith here, but you have it disabled! your
>>>      > warranty is void past this point! do not pass go, do not file a bug"?
>>>
>>>      What should the kernel say during boot if the user has not defined a
>>>      root file system?
>>>
>>>      Maybe the "stonith-enabled=false" setting should be called either
>>>      "data-corruption-mode=true" or "hang-forever-on-error=true" ;-)
>>>
>>>      Regards,
>>>      Ulrich
>>>
>>>
>>>
>>>      _______________________________________________
>>>      Users mailing list: Users at clusterlabs.org <mailto:Users at clusterlabs.org>
>>>      http://clusterlabs.org/mailman/listinfo/users
>>>
>>>      Project Home: http://www.clusterlabs.org
>>>      Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>      Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.ca/w/
>> What if the cure for cancer is trapped in the mind of a person without
>> access to education?
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>