[ClusterLabs] Corosync with passive rrp, udpu - Unable to reset after "Marking ringid 1 interface 127.0.0.1 FAULTY"

Fri Jun 17 04:57:51 EDT 2016

Martin,

> Hi Jan
>
> Thanks for your super quick response !
>
> We do not use a Network Manager - it's all static on these Ubuntu 14.04 nodes
> (/etc/network/interfaces).

Good

>
> I do not think we did an ifdown on the network interface manually. However, the
> IP-Addresses are assigned to bond0 and bond1 - we use 4x physical network
> interfaces with 2x bond'ed into a public (bond1) and 2x bond'ed into a private
> network (bond0).
>
> Could this have anything to do with it ?

I don't think so. Problem really happens only when corosync is 
configured to ip address which disappears so it has to rebind to 
127.0.0.1. You would then see "The network interface is down" in the 
logs. Try to find that msg, if it's really problem I was referring about.

Regards,
   Honza
>
> Regards,
> Martin Schlegel
>
> ___________________
>
>  From /etc/network/interfaces, i.e.
>
> auto bond0
> iface bond0 inet static
> #pre-up /sbin/ethtool -s bond0 speed 1000 duplex full autoneg on
> post-up ifenslave bond0 eth0 eth2
> pre-down ifenslave -d bond0 eth0 eth2
> bond-slaves none
> bond-mode 4
> bond-lacp-rate fast
> bond-miimon 100
> bond-downdelay 0
> bond-updelay 0
> bond-xmit_hash_policy 1
> address  [...]
>
>> Jan Friesse <jfriesse at redhat.com> hat am 16. Juni 2016 um 17:55 geschrieben:
>>
>> Martin Schlegel napsal(a):
>>
>>> Hello everyone,
>>>
>>> we run a 3 node Pacemaker (1.1.14) / Corosync (2.3.5) cluster for a couple
>>> of
>>> months successfully and we have started seeing a faulty ring with unexpected
>>>   127.0.0.1 binding that we cannot reset via "corosync-cfgtool -r".
>>
>> This is problem. Bind to 127.0.0.1 = ifdown happened = problem and with
>> RRP it means BIG problem.
>>
>>> We have had this once before and only restarting Corosync (and everything
>>> else)
>>> on the node showing the unexpected 127.0.0.1 binding made the problem go
>>> away.
>>> However, in production we obviously would like to avoid this if possible.
>>
>> Just don't do ifdown. Never. If you are using NetworkManager (which does
>> ifdown by default if cable is disconnected), use something like
>> NetworkManager-config-server package (it's just change of configuration
>> so you can adopt it to whatever distribution you are using).
>>
>> Regards,
>>   Honza
>>
>>> So from the following description - how can I troubleshoot this issue and/or
>>> does anybody have a good idea what might be happening here ?
>>>
>>> We run 2x passive rrp rings across different IP-subnets via udpu and we get
>>> the
>>> following output (all IPs obfuscated) - please notice the unexpected
>>> interface
>>> binding 127.0.0.1 for host pg2.
>>>
>>> If we reset via "corosync-cfgtool -r" on each node heartbeat ring id 1
>>> briefly
>>> shows "no faults" but goes back to "FAULTY" seconds later.
>>>
>>> Regards,
>>> Martin Schlegel
>>> _____________________________________
>>>
>>> root at pg1:~# corosync-cfgtool -s
>>> Printing ring status.
>>> Local node ID 1
>>> RING ID 0
>>>   id = A.B.C1.5
>>>   status = ring 0 active with no faults
>>> RING ID 1
>>>   id = D.E.F1.170
>>>   status = Marking ringid 1 interface D.E.F1.170 FAULTY
>>>
>>> root at pg2:~# corosync-cfgtool -s
>>> Printing ring status.
>>> Local node ID 2
>>> RING ID 0
>>>   id = A.B.C2.88
>>>   status = ring 0 active with no faults
>>> RING ID 1
>>>   id = 127.0.0.1
>>>   status = Marking ringid 1 interface 127.0.0.1 FAULTY
>>>
>>> root at pg3:~# corosync-cfgtool -s
>>> Printing ring status.
>>> Local node ID 3
>>> RING ID 0
>>>   id = A.B.C3.236
>>>   status = ring 0 active with no faults
>>> RING ID 1
>>>   id = D.E.F3.112
>>>   status = Marking ringid 1 interface D.E.F3.112 FAULTY
>>>
>>> _____________________________________
>>>
>>> /etc/corosync/corosync.conf from pg1 0 other nodes use different subnets and
>>> IPs, but are otherwise identical:
>>> ===========================================
>>> quorum {
>>>   provider: corosync_votequorum
>>>   expected_votes: 3
>>> }
>>>
>>> totem {
>>>   version: 2
>>>
>>>   crypto_cipher: none
>>>   crypto_hash: none
>>>
>>>   rrp_mode: passive
>>>   interface {
>>>   ringnumber: 0
>>>   bindnetaddr: A.B.C1.0
>>>   mcastport: 5405
>>>   ttl: 1
>>>   }
>>>   interface {
>>>   ringnumber: 1
>>>   bindnetaddr: D.E.F1.64
>>>   mcastport: 5405
>>>   ttl: 1
>>>   }
>>>   transport: udpu
>>> }
>>>
>>> nodelist {
>>>   node {
>>>   ring0_addr: pg1
>>>   ring1_addr: pg1p
>>>   nodeid: 1
>>>   }
>>>   node {
>>>   ring0_addr: pg2
>>>   ring1_addr: pg2p
>>>   nodeid: 2
>>>   }
>>>   node {
>>>   ring0_addr: pg3
>>>   ring1_addr: pg3p
>>>   nodeid: 3
>>>   }
>>> }
>>>
>>> logging {
>>>   to_syslog: yes
>>> }
>>>
>>> ===========================================
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>>