[ClusterLabs] Antw: Re: Corosync with passive rrp, udpu - Unable to reset after "Marking ringid 1 interface 127.0.0.1 FAULTY"

Tue Jul 19 06:41:15 UTC 2016

>>> Martin Schlegel <martin at nuboreto.org> schrieb am 19.07.2016 um 00:51 in
Nachricht
<301244266.332724.5ea3ddc5-55ea-43b0-9a1b-22ebb1dcafd2.open-xchange at email.1und1.
e>:
> Thanks Jan !
> 
> If anybody else is hitting the error of a ring being bound to 127.0.0.1 
> instead
> of the configured IP and corosync-cfgtool -s showing "[...] interface 
> 127.0.0.1
> FAULTY [...]" ....
> 
> We noticed an issue occasionally occurring at boot time, that we believe to 
> be a
> bug in Ubuntu 14.04. It causes Corosync to start before all bindnetaddr IPs 
> are
> up and running. 

Would it happen also if someone does a "rcnetwork restart" while the cluster is up? I think we had it once in SLES11 also, but I wa snever sure how it was triggered.

> 
> What happens is that despite the $network dependency and correct order for 
> the
> corosync runlevel script the corosync service might be started after only 
> the
> bond0 interface was fully started, but before our bond1 interface was 
> assigned
> the IP-address.
> 
> For now we have added some code to the Corosync runlevel scripts that waits 
> for
> a certain amount for whatever bindnetaddr-IPs had been configured in
> /etc/corosync/corosync.conf .
> 
> Cheers,
> Martin Schlegel
> 
> 
>> Jan Friesse <jfriesse at redhat.com> hat am 16. Juni 2016 um 17:55 geschrieben:
>> 
>> Martin Schlegel napsal(a):
>> 
>> > Hello everyone,
>> > 
>> > we run a 3 node Pacemaker (1.1.14) / Corosync (2.3.5) cluster for a couple
>> > of
>> > months successfully and we have started seeing a faulty ring with 
> unexpected
>> >  127.0.0.1 binding that we cannot reset via "corosync-cfgtool -r".
>> 
>> This is problem. Bind to 127.0.0.1 = ifdown happened = problem and with 
>> RRP it means BIG problem.
>> 
>> > We have had this once before and only restarting Corosync (and everything
>> > else)
>> > on the node showing the unexpected 127.0.0.1 binding made the problem go
>> > away.
>> > However, in production we obviously would like to avoid this if possible.
>> 
>> Just don't do ifdown. Never. If you are using NetworkManager (which does 
>> ifdown by default if cable is disconnected), use something like 
>> NetworkManager-config-server package (it's just change of configuration 
>> so you can adopt it to whatever distribution you are using).
>> 
>> Regards,
>>  Honza
>> 
>> > So from the following description - how can I troubleshoot this issue 
> and/or
>> > does anybody have a good idea what might be happening here ?
>> > 
>> > We run 2x passive rrp rings across different IP-subnets via udpu and we get
>> > the
>> > following output (all IPs obfuscated) - please notice the unexpected
>> > interface
>> > binding 127.0.0.1 for host pg2.
>> > 
>> > If we reset via "corosync-cfgtool -r" on each node heartbeat ring id 1
>> > briefly
>> > shows "no faults" but goes back to "FAULTY" seconds later.
>> > 
>> > Regards,
>> > Martin Schlegel
>> > _____________________________________
>> > 
>> > root at pg1:~# corosync-cfgtool -s
>> > Printing ring status.
>> > Local node ID 1
>> > RING ID 0
>> >  id = A.B.C1.5
>> >  status = ring 0 active with no faults
>> > RING ID 1
>> >  id = D.E.F1.170
>> >  status = Marking ringid 1 interface D.E.F1.170 FAULTY
>> > 
>> > root at pg2:~# corosync-cfgtool -s
>> > Printing ring status.
>> > Local node ID 2
>> > RING ID 0
>> >  id = A.B.C2.88
>> >  status = ring 0 active with no faults
>> > RING ID 1
>> >  id = 127.0.0.1
>> >  status = Marking ringid 1 interface 127.0.0.1 FAULTY
>> > 
>> > root at pg3:~# corosync-cfgtool -s
>> > Printing ring status.
>> > Local node ID 3
>> > RING ID 0
>> >  id = A.B.C3.236
>> >  status = ring 0 active with no faults
>> > RING ID 1
>> >  id = D.E.F3.112
>> >  status = Marking ringid 1 interface D.E.F3.112 FAULTY
>> > 
>> > _____________________________________
>> > 
>> > /etc/corosync/corosync.conf from pg1 0 other nodes use different subnets 
> and
>> > IPs, but are otherwise identical:
>> > ===========================================
>> > quorum {
>> >  provider: corosync_votequorum
>> >  expected_votes: 3
>> > }
>> > 
>> > totem {
>> >  version: 2
>> > 
>> >  crypto_cipher: none
>> >  crypto_hash: none
>> > 
>> >  rrp_mode: passive
>> >  interface {
>> >  ringnumber: 0
>> >  bindnetaddr: A.B.C1.0
>> >  mcastport: 5405
>> >  ttl: 1
>> >  }
>> >  interface {
>> >  ringnumber: 1
>> >  bindnetaddr: D.E.F1.64
>> >  mcastport: 5405
>> >  ttl: 1
>> >  }
>> >  transport: udpu
>> > }
>> > 
>> > nodelist {
>> >  node {
>> >  ring0_addr: pg1
>> >  ring1_addr: pg1p
>> >  nodeid: 1
>> >  }
>> >  node {
>> >  ring0_addr: pg2
>> >  ring1_addr: pg2p
>> >  nodeid: 2
>> >  }
>> >  node {
>> >  ring0_addr: pg3
>> >  ring1_addr: pg3p
>> >  nodeid: 3
>> >  }
>> > }
>> > 
>> > logging {
>> >  to_syslog: yes
>> > }
>> > 
>> > ===========================================
>> > 
>> > _______________________________________________
>> > Users mailing list: Users at clusterlabs.org 
>> > http://clusterlabs.org/mailman/listinfo/users 
>> > 
>> > Project Home: http://www.clusterlabs.org 
>> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> > Bugs: http://bugs.clusterlabs.org 
>> 
>> >
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org