[ClusterLabs] Antw: Re: Corosync with passive rrp, udpu - Unable to reset after "Marking ringid 1 interface 127.0.0.1 FAULTY"
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Tue Jul 19 06:41:15 UTC 2016
>>> Martin Schlegel <martin at nuboreto.org> schrieb am 19.07.2016 um 00:51 in
Nachricht
<301244266.332724.5ea3ddc5-55ea-43b0-9a1b-22ebb1dcafd2.open-xchange at email.1und1.
e>:
> Thanks Jan !
>
> If anybody else is hitting the error of a ring being bound to 127.0.0.1
> instead
> of the configured IP and corosync-cfgtool -s showing "[...] interface
> 127.0.0.1
> FAULTY [...]" ....
>
> We noticed an issue occasionally occurring at boot time, that we believe to
> be a
> bug in Ubuntu 14.04. It causes Corosync to start before all bindnetaddr IPs
> are
> up and running.
Would it happen also if someone does a "rcnetwork restart" while the cluster is up? I think we had it once in SLES11 also, but I wa snever sure how it was triggered.
>
> What happens is that despite the $network dependency and correct order for
> the
> corosync runlevel script the corosync service might be started after only
> the
> bond0 interface was fully started, but before our bond1 interface was
> assigned
> the IP-address.
>
> For now we have added some code to the Corosync runlevel scripts that waits
> for
> a certain amount for whatever bindnetaddr-IPs had been configured in
> /etc/corosync/corosync.conf .
>
> Cheers,
> Martin Schlegel
>
>
>> Jan Friesse <jfriesse at redhat.com> hat am 16. Juni 2016 um 17:55 geschrieben:
>>
>> Martin Schlegel napsal(a):
>>
>> > Hello everyone,
>> >
>> > we run a 3 node Pacemaker (1.1.14) / Corosync (2.3.5) cluster for a couple
>> > of
>> > months successfully and we have started seeing a faulty ring with
> unexpected
>> > 127.0.0.1 binding that we cannot reset via "corosync-cfgtool -r".
>>
>> This is problem. Bind to 127.0.0.1 = ifdown happened = problem and with
>> RRP it means BIG problem.
>>
>> > We have had this once before and only restarting Corosync (and everything
>> > else)
>> > on the node showing the unexpected 127.0.0.1 binding made the problem go
>> > away.
>> > However, in production we obviously would like to avoid this if possible.
>>
>> Just don't do ifdown. Never. If you are using NetworkManager (which does
>> ifdown by default if cable is disconnected), use something like
>> NetworkManager-config-server package (it's just change of configuration
>> so you can adopt it to whatever distribution you are using).
>>
>> Regards,
>> Honza
>>
>> > So from the following description - how can I troubleshoot this issue
> and/or
>> > does anybody have a good idea what might be happening here ?
>> >
>> > We run 2x passive rrp rings across different IP-subnets via udpu and we get
>> > the
>> > following output (all IPs obfuscated) - please notice the unexpected
>> > interface
>> > binding 127.0.0.1 for host pg2.
>> >
>> > If we reset via "corosync-cfgtool -r" on each node heartbeat ring id 1
>> > briefly
>> > shows "no faults" but goes back to "FAULTY" seconds later.
>> >
>> > Regards,
>> > Martin Schlegel
>> > _____________________________________
>> >
>> > root at pg1:~# corosync-cfgtool -s
>> > Printing ring status.
>> > Local node ID 1
>> > RING ID 0
>> > id = A.B.C1.5
>> > status = ring 0 active with no faults
>> > RING ID 1
>> > id = D.E.F1.170
>> > status = Marking ringid 1 interface D.E.F1.170 FAULTY
>> >
>> > root at pg2:~# corosync-cfgtool -s
>> > Printing ring status.
>> > Local node ID 2
>> > RING ID 0
>> > id = A.B.C2.88
>> > status = ring 0 active with no faults
>> > RING ID 1
>> > id = 127.0.0.1
>> > status = Marking ringid 1 interface 127.0.0.1 FAULTY
>> >
>> > root at pg3:~# corosync-cfgtool -s
>> > Printing ring status.
>> > Local node ID 3
>> > RING ID 0
>> > id = A.B.C3.236
>> > status = ring 0 active with no faults
>> > RING ID 1
>> > id = D.E.F3.112
>> > status = Marking ringid 1 interface D.E.F3.112 FAULTY
>> >
>> > _____________________________________
>> >
>> > /etc/corosync/corosync.conf from pg1 0 other nodes use different subnets
> and
>> > IPs, but are otherwise identical:
>> > ===========================================
>> > quorum {
>> > provider: corosync_votequorum
>> > expected_votes: 3
>> > }
>> >
>> > totem {
>> > version: 2
>> >
>> > crypto_cipher: none
>> > crypto_hash: none
>> >
>> > rrp_mode: passive
>> > interface {
>> > ringnumber: 0
>> > bindnetaddr: A.B.C1.0
>> > mcastport: 5405
>> > ttl: 1
>> > }
>> > interface {
>> > ringnumber: 1
>> > bindnetaddr: D.E.F1.64
>> > mcastport: 5405
>> > ttl: 1
>> > }
>> > transport: udpu
>> > }
>> >
>> > nodelist {
>> > node {
>> > ring0_addr: pg1
>> > ring1_addr: pg1p
>> > nodeid: 1
>> > }
>> > node {
>> > ring0_addr: pg2
>> > ring1_addr: pg2p
>> > nodeid: 2
>> > }
>> > node {
>> > ring0_addr: pg3
>> > ring1_addr: pg3p
>> > nodeid: 3
>> > }
>> > }
>> >
>> > logging {
>> > to_syslog: yes
>> > }
>> >
>> > ===========================================
>> >
>> > _______________________________________________
>> > Users mailing list: Users at clusterlabs.org
>> > http://clusterlabs.org/mailman/listinfo/users
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs: http://bugs.clusterlabs.org
>>
>> >
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list