[ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)

Klaus Wenninger kwenning at redhat.com
Thu Oct 6 10:26:25 EDT 2016


On 10/06/2016 04:16 PM, Digimer wrote:
> On 06/10/16 05:38 AM, Martin Schlegel wrote:
>> Thanks for the confirmation Jan, but this sounds a bit scary to me !
>>
>> Spinning this experiment a bit further ...
>>
>> Would this not also mean that with a passive rrp with 2 rings it only takes 2
>> different nodes that are not able to communicate on different networks at the
>> same time to have all rings marked faulty on _every_node ... therefore all
>> cluster members loosing quorum immediately even though n-2 cluster members are
>> technically able to send and receive heartbeat messages through all 2 rings ?
>>
>> I really hope the answer is no and the cluster still somehow has a quorum in
>> this case.
>>
>> Regards,
>> Martin Schlegel
>>
>>
>>> Jan Friesse <jfriesse at redhat.com> hat am 5. Oktober 2016 um 09:01 geschrieben:
>>>
>>> Martin,
>>>
>>>> Hello all,
>>>>
>>>> I am trying to understand why the following 2 Corosync heartbeat ring
>>>> failure
>>>> scenarios
>>>> I have been testing and hope somebody can explain why this makes any sense.
>>>>
>>>> Consider the following cluster:
>>>>
>>>>  * 3x Nodes: A, B and C
>>>>  * 2x NICs for each Node
>>>>  * Corosync 2.3.5 configured with "rrp_mode: passive" and
>>>>  udpu transport with ring id 0 and 1 on each node.
>>>>  * On each node "corosync-cfgtool -s" shows:
>>>>  [...] ring 0 active with no faults
>>>>  [...] ring 1 active with no faults
>>>>
>>>> Consider the following scenarios:
>>>>
>>>>  1. On node A only block all communication on the first NIC configured with
>>>> ring id 0
>>>>  2. On node A only block all communication on all NICs configured with
>>>> ring id 0 and 1
>>>>
>>>> The result of the above scenarios is as follows:
>>>>
>>>>  1. Nodes A, B and C (!) display the following ring status:
>>>>  [...] Marking ringid 0 interface <IP-Address> FAULTY
>>>>  [...] ring 1 active with no faults
>>>>  2. Node A is shown as OFFLINE - B and C display the following ring status:
>>>>  [...] ring 0 active with no faults
>>>>  [...] ring 1 active with no faults
>>>>
>>>> Questions:
>>>>  1. Is this the expected outcome ?
>>> Yes
>>>
>>>> 2. In experiment 1. B and C can still communicate with each other over both
>>>> NICs, so why are
>>>>  B and C not displaying a "no faults" status for ring id 0 and 1 just like
>>>> in experiment 2.
>>> Because this is how RRP works. RRP marks whole ring as failed so every 
>>> node sees that ring as failed.
>>>
>>>> when node A is completely unreachable ?
>>> Because it's different scenario. In scenario 1 there are 3 nodes 
>>> membership where one of them has failed one ring -> whole ring is 
>>> failed. In scenario 2 there are 2 nodes membership where both rings 
>>> works as expected. Node A is completely unreachable and it's not in the 
>>> membership.
>>>
>>> Regards,
>>>  Honza
> Have you considered using active/passive bonded interfaces? If you did,
> you would be able to fail links in any order on the nodes and corosync
> would not know the difference.
>
Still an interesting point I hadn't been aware of that far - although
I knew the bits I probably hadn't thought about them enough till
now...

Usually one - at least me so far - would rather think that having
the awareness of redundany/cluster as high up as possible in the
protocol/application-stack would open up possibilities for more
appropriate reactions.






More information about the Users mailing list