[ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)

Thu Oct 6 09:35:47 UTC 2016

Thanks for the confirmation Jan, but this sounds a bit scary to me !

Spinning this experiment a bit further ...

Would this not also mean that with a passive rrp with 2 rings it only takes 2
different nodes that are not able to communicate on different networks at the
same time to have all rings marked faulty on _every_node ... therefore all
cluster members loosing quorum immediately even though n-2 cluster members are
technically able to send and receive heartbeat messages through all 2 rings ?

I really hope the answer is no and the cluster still somehow has a quorum in
this case.

Regards,
Martin Schlegel

> Jan Friesse <jfriesse at redhat.com> hat am 5. Oktober 2016 um 09:01 geschrieben:
> 
> Martin,
> 
> > Hello all,
> > 
> > I am trying to understand why the following 2 Corosync heartbeat ring
> > failure
> > scenarios
> > I have been testing and hope somebody can explain why this makes any sense.
> > 
> > Consider the following cluster:
> > 
> >  * 3x Nodes: A, B and C
> >  * 2x NICs for each Node
> >  * Corosync 2.3.5 configured with "rrp_mode: passive" and
> >  udpu transport with ring id 0 and 1 on each node.
> >  * On each node "corosync-cfgtool -s" shows:
> >  [...] ring 0 active with no faults
> >  [...] ring 1 active with no faults
> > 
> > Consider the following scenarios:
> > 
> >  1. On node A only block all communication on the first NIC configured with
> > ring id 0
> >  2. On node A only block all communication on all NICs configured with
> > ring id 0 and 1
> > 
> > The result of the above scenarios is as follows:
> > 
> >  1. Nodes A, B and C (!) display the following ring status:
> >  [...] Marking ringid 0 interface <IP-Address> FAULTY
> >  [...] ring 1 active with no faults
> >  2. Node A is shown as OFFLINE - B and C display the following ring status:
> >  [...] ring 0 active with no faults
> >  [...] ring 1 active with no faults
> > 
> > Questions:
> >  1. Is this the expected outcome ?
> 
> Yes
> 
> > 2. In experiment 1. B and C can still communicate with each other over both
> > NICs, so why are
> >  B and C not displaying a "no faults" status for ring id 0 and 1 just like
> > in experiment 2.
> 
> Because this is how RRP works. RRP marks whole ring as failed so every 
> node sees that ring as failed.
> 
> > when node A is completely unreachable ?
> 
> Because it's different scenario. In scenario 1 there are 3 nodes 
> membership where one of them has failed one ring -> whole ring is 
> failed. In scenario 2 there are 2 nodes membership where both rings 
> works as expected. Node A is completely unreachable and it's not in the 
> membership.
> 
> Regards,
>  Honza
> 
> > Regards,
> > Martin Schlegel
> > 
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> >