[ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)

Fri Oct 7 15:08:46 UTC 2016

Thanks for all responses from Jan, Ulrich and Digimer !

We are already using bond'ed network interfaces, but we are also forced to go
across IP-subnets. Certain routes between routers can go and have gone missing.

This has happened for one of our node's public network, where it was
inaccessible to other local, public IP-subnets. If this were to happen in
parallel on another node of our private network the entire cluster would be
down, just because - as Ulrich said "It's a ring !" - both heartbeat rings are
marked faulty. It's not an optimal result, because cluster communication is in
fact 100% possible between all nodes.

With an increasing number of nodes this risk is fairly big. Just think about
providers of bigger cloud infrastructures.

With the above scenario in mind - is there a better (tested and recommended) way
to configure this ?
... or is knet the way to go in the future then ?

Regards,
Martin Schlegel

> Jan Friesse <jfriesse at redhat.com> hat am 7. Oktober 2016 um 11:28 geschrieben:
> 
> Martin Schlegel napsal(a):
> 
> > Thanks for the confirmation Jan, but this sounds a bit scary to me !
> > 
> > Spinning this experiment a bit further ...
> > 
> > Would this not also mean that with a passive rrp with 2 rings it only takes
> > 2
> > different nodes that are not able to communicate on different networks at
> > the
> > same time to have all rings marked faulty on _every_node ... therefore all
> > cluster members loosing quorum immediately even though n-2 cluster members
> > are
> > technically able to send and receive heartbeat messages through all 2 rings
> > ?
> 
> Not exactly, but this situation causes corosync to start behaving really 
> badly spending most of the time in "creating new membership" loop.
> 
> Yes, RRP is simply bad. If you can, use bonding. Improvement of RRP by 
> replace it by knet is biggest TODO for 3.x.
> 
> Regards,
>  Honza
> 
> > I really hope the answer is no and the cluster still somehow has a quorum in
> > this case.
> > 
> > Regards,
> > Martin Schlegel
> 
> >> Jan Friesse <jfriesse at redhat.com> hat am 5. Oktober 2016 um 09:01
> >> geschrieben:>>
> >> Martin,
> >>
> >>> Hello all,
> >>>
> >>> I am trying to understand why the following 2 Corosync heartbeat ring
> >>> failure
> >>> scenarios
> >>> I have been testing and hope somebody can explain why this makes any
> >>> sense.
> >>>
> >>> Consider the following cluster:
> >>>
> >>> * 3x Nodes: A, B and C
> >>> * 2x NICs for each Node
> >>> * Corosync 2.3.5 configured with "rrp_mode: passive" and
> >>> udpu transport with ring id 0 and 1 on each node.
> >>> * On each node "corosync-cfgtool -s" shows:
> >>> [...] ring 0 active with no faults
> >>> [...] ring 1 active with no faults
> >>>
> >>> Consider the following scenarios:
> >>>
> >>> 1. On node A only block all communication on the first NIC configured with
> >>> ring id 0
> >>> 2. On node A only block all communication on all NICs configured with
> >>> ring id 0 and 1
> >>>
> >>> The result of the above scenarios is as follows:
> >>>
> >>> 1. Nodes A, B and C (!) display the following ring status:
> >>> [...] Marking ringid 0 interface <IP-Address> FAULTY
> >>> [...] ring 1 active with no faults
> >>> 2. Node A is shown as OFFLINE - B and C display the following ring status:
> >>> [...] ring 0 active with no faults
> >>> [...] ring 1 active with no faults
> >>>
> >>> Questions:
> >>> 1. Is this the expected outcome ?
> >>
> >> Yes
> >>
> >>> 2. In experiment 1. B and C can still communicate with each other over
> >>> both
> >>> NICs, so why are
> >>> B and C not displaying a "no faults" status for ring id 0 and 1 just like
> >>> in experiment 2.
> >>
> >> Because this is how RRP works. RRP marks whole ring as failed so every
> >> node sees that ring as failed.
> >>
> >>> when node A is completely unreachable ?
> >>
> >> Because it's different scenario. In scenario 1 there are 3 nodes
> >> membership where one of them has failed one ring -> whole ring is
> >> failed. In scenario 2 there are 2 nodes membership where both rings
> >> works as expected. Node A is completely unreachable and it's not in the
> >> membership.
> >>
> >> Regards,
> >> Honza
> >>
> >>> Regards,
> >>> Martin Schlegel
> >>>
> >>> _______________________________________________
> >>> Users mailing list: Users at clusterlabs.org
> >>> http://clusterlabs.org/mailman/listinfo/users
> >>>
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> Bugs: http://bugs.clusterlabs.org
> >>
> >>>
> 
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> >