[ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)

Jan Friesse jfriesse at redhat.com
Fri Oct 7 09:28:13 UTC 2016


Martin Schlegel napsal(a):
> Thanks for the confirmation Jan, but this sounds a bit scary to me !
>
> Spinning this experiment a bit further ...
>
> Would this not also mean that with a passive rrp with 2 rings it only takes 2
> different nodes that are not able to communicate on different networks at the
> same time to have all rings marked faulty on _every_node ... therefore all
> cluster members loosing quorum immediately even though n-2 cluster members are
> technically able to send and receive heartbeat messages through all 2 rings ?

Not exactly, but this situation causes corosync to start behaving really 
badly spending most of the time in "creating new membership" loop.

Yes, RRP is simply bad. If you can, use bonding. Improvement of RRP by 
replace it by knet is biggest TODO for 3.x.

Regards,
   Honza

>
> I really hope the answer is no and the cluster still somehow has a quorum in
> this case.
>
> Regards,
> Martin Schlegel
>
>
>> Jan Friesse <jfriesse at redhat.com> hat am 5. Oktober 2016 um 09:01 geschrieben:
>>
>> Martin,
>>
>>> Hello all,
>>>
>>> I am trying to understand why the following 2 Corosync heartbeat ring
>>> failure
>>> scenarios
>>> I have been testing and hope somebody can explain why this makes any sense.
>>>
>>> Consider the following cluster:
>>>
>>>   * 3x Nodes: A, B and C
>>>   * 2x NICs for each Node
>>>   * Corosync 2.3.5 configured with "rrp_mode: passive" and
>>>   udpu transport with ring id 0 and 1 on each node.
>>>   * On each node "corosync-cfgtool -s" shows:
>>>   [...] ring 0 active with no faults
>>>   [...] ring 1 active with no faults
>>>
>>> Consider the following scenarios:
>>>
>>>   1. On node A only block all communication on the first NIC configured with
>>> ring id 0
>>>   2. On node A only block all communication on all NICs configured with
>>> ring id 0 and 1
>>>
>>> The result of the above scenarios is as follows:
>>>
>>>   1. Nodes A, B and C (!) display the following ring status:
>>>   [...] Marking ringid 0 interface <IP-Address> FAULTY
>>>   [...] ring 1 active with no faults
>>>   2. Node A is shown as OFFLINE - B and C display the following ring status:
>>>   [...] ring 0 active with no faults
>>>   [...] ring 1 active with no faults
>>>
>>> Questions:
>>>   1. Is this the expected outcome ?
>>
>> Yes
>>
>>> 2. In experiment 1. B and C can still communicate with each other over both
>>> NICs, so why are
>>>   B and C not displaying a "no faults" status for ring id 0 and 1 just like
>>> in experiment 2.
>>
>> Because this is how RRP works. RRP marks whole ring as failed so every
>> node sees that ring as failed.
>>
>>> when node A is completely unreachable ?
>>
>> Because it's different scenario. In scenario 1 there are 3 nodes
>> membership where one of them has failed one ring -> whole ring is
>> failed. In scenario 2 there are 2 nodes membership where both rings
>> works as expected. Node A is completely unreachable and it's not in the
>> membership.
>>
>> Regards,
>>   Honza
>>
>>> Regards,
>>> Martin Schlegel
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>





More information about the Users mailing list