[ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)

Jan Friesse jfriesse at redhat.com
Mon Oct 10 03:16:35 EDT 2016


> Thanks for all responses from Jan, Ulrich and Digimer !
>
> We are already using bond'ed network interfaces, but we are also forced to go
> across IP-subnets. Certain routes between routers can go and have gone missing.
>
> This has happened for one of our node's public network, where it was
> inaccessible to other local, public IP-subnets. If this were to happen in
> parallel on another node of our private network the entire cluster would be
> down, just because - as Ulrich said "It's a ring !" - both heartbeat rings are
> marked faulty. It's not an optimal result, because cluster communication is in
> fact 100% possible between all nodes.
>
> With an increasing number of nodes this risk is fairly big. Just think about
> providers of bigger cloud infrastructures.
>
> With the above scenario in mind - is there a better (tested and recommended) way
> to configure this ?

I don't think so.

> ... or is knet the way to go in the future then ?

Yes, knet is future.

Regards,
   Honza

>
>
> Regards,
> Martin Schlegel
>
>
>> Jan Friesse <jfriesse at redhat.com> hat am 7. Oktober 2016 um 11:28 geschrieben:
>>
>> Martin Schlegel napsal(a):
>>
>>> Thanks for the confirmation Jan, but this sounds a bit scary to me !
>>>
>>> Spinning this experiment a bit further ...
>>>
>>> Would this not also mean that with a passive rrp with 2 rings it only takes
>>> 2
>>> different nodes that are not able to communicate on different networks at
>>> the
>>> same time to have all rings marked faulty on _every_node ... therefore all
>>> cluster members loosing quorum immediately even though n-2 cluster members
>>> are
>>> technically able to send and receive heartbeat messages through all 2 rings
>>> ?
>>
>> Not exactly, but this situation causes corosync to start behaving really
>> badly spending most of the time in "creating new membership" loop.
>>
>> Yes, RRP is simply bad. If you can, use bonding. Improvement of RRP by
>> replace it by knet is biggest TODO for 3.x.
>>
>> Regards,
>>   Honza
>>
>>> I really hope the answer is no and the cluster still somehow has a quorum in
>>> this case.
>>>
>>> Regards,
>>> Martin Schlegel
>>
>>>> Jan Friesse <jfriesse at redhat.com> hat am 5. Oktober 2016 um 09:01
>>>> geschrieben:>>
>>>> Martin,
>>>>
>>>>> Hello all,
>>>>>
>>>>> I am trying to understand why the following 2 Corosync heartbeat ring
>>>>> failure
>>>>> scenarios
>>>>> I have been testing and hope somebody can explain why this makes any
>>>>> sense.
>>>>>
>>>>> Consider the following cluster:
>>>>>
>>>>> * 3x Nodes: A, B and C
>>>>> * 2x NICs for each Node
>>>>> * Corosync 2.3.5 configured with "rrp_mode: passive" and
>>>>> udpu transport with ring id 0 and 1 on each node.
>>>>> * On each node "corosync-cfgtool -s" shows:
>>>>> [...] ring 0 active with no faults
>>>>> [...] ring 1 active with no faults
>>>>>
>>>>> Consider the following scenarios:
>>>>>
>>>>> 1. On node A only block all communication on the first NIC configured with
>>>>> ring id 0
>>>>> 2. On node A only block all communication on all NICs configured with
>>>>> ring id 0 and 1
>>>>>
>>>>> The result of the above scenarios is as follows:
>>>>>
>>>>> 1. Nodes A, B and C (!) display the following ring status:
>>>>> [...] Marking ringid 0 interface <IP-Address> FAULTY
>>>>> [...] ring 1 active with no faults
>>>>> 2. Node A is shown as OFFLINE - B and C display the following ring status:
>>>>> [...] ring 0 active with no faults
>>>>> [...] ring 1 active with no faults
>>>>>
>>>>> Questions:
>>>>> 1. Is this the expected outcome ?
>>>>
>>>> Yes
>>>>
>>>>> 2. In experiment 1. B and C can still communicate with each other over
>>>>> both
>>>>> NICs, so why are
>>>>> B and C not displaying a "no faults" status for ring id 0 and 1 just like
>>>>> in experiment 2.
>>>>
>>>> Because this is how RRP works. RRP marks whole ring as failed so every
>>>> node sees that ring as failed.
>>>>
>>>>> when node A is completely unreachable ?
>>>>
>>>> Because it's different scenario. In scenario 1 there are 3 nodes
>>>> membership where one of them has failed one ring -> whole ring is
>>>> failed. In scenario 2 there are 2 nodes membership where both rings
>>>> works as expected. Node A is completely unreachable and it's not in the
>>>> membership.
>>>>
>>>> Regards,
>>>> Honza
>>>>
>>>>> Regards,
>>>>> Martin Schlegel
>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list: Users at clusterlabs.org
>>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>>>
>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>>





More information about the Users mailing list