[ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)

Thu Oct 6 14:16:27 UTC 2016

On 06/10/16 05:38 AM, Martin Schlegel wrote:
> Thanks for the confirmation Jan, but this sounds a bit scary to me !
> 
> Spinning this experiment a bit further ...
> 
> Would this not also mean that with a passive rrp with 2 rings it only takes 2
> different nodes that are not able to communicate on different networks at the
> same time to have all rings marked faulty on _every_node ... therefore all
> cluster members loosing quorum immediately even though n-2 cluster members are
> technically able to send and receive heartbeat messages through all 2 rings ?
> 
> I really hope the answer is no and the cluster still somehow has a quorum in
> this case.
> 
> Regards,
> Martin Schlegel
> 
> 
>> Jan Friesse <jfriesse at redhat.com> hat am 5. Oktober 2016 um 09:01 geschrieben:
>>
>> Martin,
>>
>>> Hello all,
>>>
>>> I am trying to understand why the following 2 Corosync heartbeat ring
>>> failure
>>> scenarios
>>> I have been testing and hope somebody can explain why this makes any sense.
>>>
>>> Consider the following cluster:
>>>
>>>  * 3x Nodes: A, B and C
>>>  * 2x NICs for each Node
>>>  * Corosync 2.3.5 configured with "rrp_mode: passive" and
>>>  udpu transport with ring id 0 and 1 on each node.
>>>  * On each node "corosync-cfgtool -s" shows:
>>>  [...] ring 0 active with no faults
>>>  [...] ring 1 active with no faults
>>>
>>> Consider the following scenarios:
>>>
>>>  1. On node A only block all communication on the first NIC configured with
>>> ring id 0
>>>  2. On node A only block all communication on all NICs configured with
>>> ring id 0 and 1
>>>
>>> The result of the above scenarios is as follows:
>>>
>>>  1. Nodes A, B and C (!) display the following ring status:
>>>  [...] Marking ringid 0 interface <IP-Address> FAULTY
>>>  [...] ring 1 active with no faults
>>>  2. Node A is shown as OFFLINE - B and C display the following ring status:
>>>  [...] ring 0 active with no faults
>>>  [...] ring 1 active with no faults
>>>
>>> Questions:
>>>  1. Is this the expected outcome ?
>>
>> Yes
>>
>>> 2. In experiment 1. B and C can still communicate with each other over both
>>> NICs, so why are
>>>  B and C not displaying a "no faults" status for ring id 0 and 1 just like
>>> in experiment 2.
>>
>> Because this is how RRP works. RRP marks whole ring as failed so every 
>> node sees that ring as failed.
>>
>>> when node A is completely unreachable ?
>>
>> Because it's different scenario. In scenario 1 there are 3 nodes 
>> membership where one of them has failed one ring -> whole ring is 
>> failed. In scenario 2 there are 2 nodes membership where both rings 
>> works as expected. Node A is completely unreachable and it's not in the 
>> membership.
>>
>> Regards,
>>  Honza

Have you considered using active/passive bonded interfaces? If you did,
you would be able to fail links in any order on the nodes and corosync
would not know the difference.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?