[ClusterLabs] Antw: Re: Corosync with passive rrp, udpu - Unable to reset after "Marking ringid 1 interface 127.0.0.1 FAULTY"

Tue Jul 19 12:27:20 UTC 2016

> Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de> hat am 19. Juli 2016 um 08:41
> geschrieben:
> 
> >>> Martin Schlegel <martin at nuboreto.org> schrieb am 19.07.2016 um 00:51 in
> Nachricht
> <301244266.332724.5ea3ddc5-55ea-43b0-9a1b-22ebb1dcafd2.open-xchange at email.1und1.
> e>:
> 
> > Thanks Jan !
> > 
> > If anybody else is hitting the error of a ring being bound to 127.0.0.1 
> > instead
> > of the configured IP and corosync-cfgtool -s showing "[...] interface 
> > 127.0.0.1
> > FAULTY [...]" ....
> > 
> > We noticed an issue occasionally occurring at boot time, that we believe to 
> > be a
> > bug in Ubuntu 14.04. It causes Corosync to start before all bindnetaddr IPs 
> > are
> > up and running.
> 
> Would it happen also if someone does a "rcnetwork restart" while the cluster
> is up? I think we had it once in SLES11 also, but I wa snever sure how it was
> triggered.

Yes, it likely would as this would do a ifdown on the network interfaces.
However, as we could confirm in Syslog our issues always originated at boot time
and only affected us later when the other remaining ring failed also. We are
adding monitoring to detect any of the rings being marked faulty. In this case
we are warned and we could try to recover the ring manually in case the
automatic recovery had given up.

I wish the health state of each ring per node would be available cluster wide,
but I have not found out how. Instead I need to gather the output of
"corosync-cfgtool -s" on each node.

> 
> > What happens is that despite the $network dependency and correct order for 
> > the
> > corosync runlevel script the corosync service might be started after only 
> > the
> > bond0 interface was fully started, but before our bond1 interface was 
> > assigned
> > the IP-address.
> > 
> > For now we have added some code to the Corosync runlevel scripts that waits 
> > for
> > a certain amount for whatever bindnetaddr-IPs had been configured in
> > /etc/corosync/corosync.conf .
> > 
> > Cheers,
> > Martin Schlegel
> 
> >> Jan Friesse <jfriesse at redhat.com> hat am 16. Juni 2016 um 17:55
> >> geschrieben:>> 
> >> Martin Schlegel napsal(a):
> >> 
> >> > Hello everyone,
> >> > 
> >> > we run a 3 node Pacemaker (1.1.14) / Corosync (2.3.5) cluster for a
> >> > couple
> >> > of
> >> > months successfully and we have started seeing a faulty ring with
> 
> > unexpected
> 
> >> > 127.0.0.1 binding that we cannot reset via "corosync-cfgtool -r".>> 
> >> This is problem. Bind to 127.0.0.1 = ifdown happened = problem and with 
> >> RRP it means BIG problem.
> >> 
> >> > We have had this once before and only restarting Corosync (and everything
> >> > else)
> >> > on the node showing the unexpected 127.0.0.1 binding made the problem go
> >> > away.
> >> > However, in production we obviously would like to avoid this if possible.
> >> 
> >> Just don't do ifdown. Never. If you are using NetworkManager (which does 
> >> ifdown by default if cable is disconnected), use something like 
> >> NetworkManager-config-server package (it's just change of configuration 
> >> so you can adopt it to whatever distribution you are using).
> >> 
> >> Regards,
> >> Honza
> >> 
> >> > So from the following description - how can I troubleshoot this issue
> 
> > and/or
> 
> >> > does anybody have a good idea what might be happening here ?>> > 
> >> > We run 2x passive rrp rings across different IP-subnets via udpu and we
> >> > get
> >> > the
> >> > following output (all IPs obfuscated) - please notice the unexpected
> >> > interface
> >> > binding 127.0.0.1 for host pg2.
> >> > 
> >> > If we reset via "corosync-cfgtool -r" on each node heartbeat ring id 1
> >> > briefly
> >> > shows "no faults" but goes back to "FAULTY" seconds later.
> >> > 
> >> > Regards,
> >> > Martin Schlegel
> >> > _____________________________________
> >> > 
> >> > root at pg1:~# corosync-cfgtool -s
> >> > Printing ring status.
> >> > Local node ID 1
> >> > RING ID 0
> >> > id = A.B.C1.5
> >> > status = ring 0 active with no faults
> >> > RING ID 1
> >> > id = D.E.F1.170
> >> > status = Marking ringid 1 interface D.E.F1.170 FAULTY
> >> > 
> >> > root at pg2:~# corosync-cfgtool -s
> >> > Printing ring status.
> >> > Local node ID 2
> >> > RING ID 0
> >> > id = A.B.C2.88
> >> > status = ring 0 active with no faults
> >> > RING ID 1
> >> > id = 127.0.0.1
> >> > status = Marking ringid 1 interface 127.0.0.1 FAULTY
> >> > 
> >> > root at pg3:~# corosync-cfgtool -s
> >> > Printing ring status.
> >> > Local node ID 3
> >> > RING ID 0
> >> > id = A.B.C3.236
> >> > status = ring 0 active with no faults
> >> > RING ID 1
> >> > id = D.E.F3.112
> >> > status = Marking ringid 1 interface D.E.F3.112 FAULTY
> >> > 
> >> > _____________________________________
> >> > 
> >> > /etc/corosync/corosync.conf from pg1 0 other nodes use different subnets
> 
> > and
> 
> >> > IPs, but are otherwise identical:>> >
> >> > ===========================================
> >> > quorum {
> >> > provider: corosync_votequorum
> >> > expected_votes: 3
> >> > }
> >> > 
> >> > totem {
> >> > version: 2
> >> > 
> >> > crypto_cipher: none
> >> > crypto_hash: none
> >> > 
> >> > rrp_mode: passive
> >> > interface {
> >> > ringnumber: 0
> >> > bindnetaddr: A.B.C1.0
> >> > mcastport: 5405
> >> > ttl: 1
> >> > }
> >> > interface {
> >> > ringnumber: 1
> >> > bindnetaddr: D.E.F1.64
> >> > mcastport: 5405
> >> > ttl: 1
> >> > }
> >> > transport: udpu
> >> > }
> >> > 
> >> > nodelist {
> >> > node {
> >> > ring0_addr: pg1
> >> > ring1_addr: pg1p
> >> > nodeid: 1
> >> > }
> >> > node {
> >> > ring0_addr: pg2
> >> > ring1_addr: pg2p
> >> > nodeid: 2
> >> > }
> >> > node {
> >> > ring0_addr: pg3
> >> > ring1_addr: pg3p
> >> > nodeid: 3
> >> > }
> >> > }
> >> > 
> >> > logging {
> >> > to_syslog: yes
> >> > }
> >> > 
> >> > ===========================================
> >> > 
> >> > _______________________________________________
> >> > Users mailing list: Users at clusterlabs.org 
> >> > http://clusterlabs.org/mailman/listinfo/users 
> >> > 
> >> > Project Home: http://www.clusterlabs.org 
> >> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> >> > Bugs: http://bugs.clusterlabs.org 
> >> 
> >> >