[ClusterLabs] redundant ring and corosync makes/sees it as loopback??

Fri Mar 1 02:37:13 EST 2019

lejeczek,

> hi everyone
> 
> My cluster faulted my secondary ring today and on one node I found this:
> 
> Printing ring status.
> Local node ID 3
> RING ID 0
>      id    = 10.5.8.65
>      status    = ring 0 active with no faults
> RING ID 1
>      id    = 127.0.0.1
>      status    = Marking ringid 1 interface 127.0.0.1 FAULTY
> 
> How the hell loopback address got there?

Short version: ifdown

Long version: When interface is put down, older versions of corosync 
detected such condition and rebound to localhost. Without RRP it's 
usually not that big deal, because node is (usually) fenced. With RRP 
it's much bigger problem because this 127.0.0.1 is sent to all other 
nodes and it completely poison whole cluster.

Definitive solution is to use corosync 3 with knet transport. Mostly 
working solution is to use corosync 3 (or corosync 2 - needle branch 
from git) with udpu transport.

Simple workaround is never use ifdown directly. Also network managers 
quite often reacts to carrier lostt. Solution is to ether use network 
scripts or if network manager must be used, use it with 
NetworkManager-config-server package. Solution for systemd-networkd 
should be to use IgnoreCarrierLoss= config option.

Regards,
   Honza

> 
> I did: systemctl restart corosync and all went back to "okey"
> 
> many thanks, L.
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>