[ClusterLabs] Redundant ring not recovering after node is back

Thu Aug 23 07:32:52 UTC 2018

David,

> BTW, where I can download Corosync 3.x?
> I've only seen Corosync 2.99.3 Alpha4 at http://corosync.github.io/corosync/

Yes, that's Alpha 4 of Corosync 3.

> 
> 2018-08-23 9:11 GMT+02:00 David Tolosa <david.tolosa at upcnet.es>:
> 
>> I'm currently using an Ubuntu 18.04 server configuration with netplan.
>>
>> Here you have my current YAML configuration:
>>
>> # This file describes the network interfaces available on your system
>> # For more information, see netplan(5).
>> network:
>>    version: 2
>>    renderer: networkd
>>    ethernets:
>>      eno1:
>>        addresses: [192.168.0.1/24]
>>      enp4s0f0:
>>        addresses: [192.168.1.1/24]
>>      enp5s0f0:
>>        {}
>>    vlans:
>>      vlan.XXX:
>>        id: XXX
>>        link: enp5s0f0
>>        addresses: [ 10.1.128.5/29 ]
>>        gateway4: 10.1.128.1
>>        nameservers:
>>          addresses: [ 8.8.8.8, 8.8.4.4 ]
>>          search: [ foo.com, bar.com ]
>>      vlan.YYY:
>>        id: YYY
>>        link: enp5s0f0
>>        addresses: [ 10.1.128.5/29 ]
>>
>>
>> So, eno1 and enp4s0f0 are the two ethernet ports connected each other
>> with crossover cables to node2.
>> enp5s0f0 port is used to connect outside/services using vlans defined in
>> the same file.
>>
>> In short, I'm using systemd-networkd default Ubuntu 18 server service for

Ok, so systemd-networkd is really doing ifdown and somebody actually 
tries fix it and merge into upstream (sadly with not too much luck :( )

https://github.com/systemd/systemd/pull/7403

>> manage networks. Im not detecting any NetworkManager-config-server
>> package in my repository neither.

I'm not sure how it's called in Debian based distributions, but it's 
just one small file in /etc, so you can extract it from RPM.

>> So the only solution that I have left, I suppose, is to test corosync 3.x
>> and see if it works better handling RRP.

You may also reconsider to try ether completely static network 
configuration or NetworkManager + NetworkManager-config-server.

Corosync 3.x with knet will work for sure, but be prepared for quite a 
long compile path, because you first have to compile knet and then 
corosync. What may help you a bit is that we have a ubuntu 18.04 in our 
jenkins, so it should be possible corosync build log 
https://ci.kronosnet.org/view/corosync/job/corosync-build-all-voting/lastBuild/corosync-build-all-voting=ubuntu-18-04-lts-x86-64/consoleText, 
knet build log 
https://ci.kronosnet.org/view/knet/job/knet-build-all-voting/lastBuild/knet-build-all-voting=ubuntu-18-04-lts-x86-64/consoleText).

Also please consult 
http://people.redhat.com/ccaulfie/docs/KnetCorosync.pdf about changes in 
corosync configuration.

Regards,
   Honza

>>
>> Thank you for your quick response!
>>
>> 2018-08-23 8:40 GMT+02:00 Jan Friesse <jfriesse at redhat.com>:
>>
>>> David,
>>>
>>> Hello,
>>>> Im getting crazy about this problem, that I expect to resolve here, with
>>>> your help guys:
>>>>
>>>> I have 2 nodes with Corosync redundant ring feature.
>>>>
>>>> Each node has 2 similarly connected/configured NIC's. Both nodes are
>>>> connected each other by two crossover cables.
>>>>
>>>
>>> I believe this is root of the problem. Are you using NetworkManager? If
>>> so, have you installed NetworkManager-config-server? If not, please install
>>> it and test again.
>>>
>>>
>>>> I configured both nodes with rrp mode passive. Everything is working well
>>>> at this point, but when I shutdown 1 node to test failover, and this
>>>> node > returns to be online, corosync is marking the interface as FAULTY
>>>> and rrp
>>>>
>>>
>>> I believe it's because with crossover cables configuration when other
>>> side is shutdown, NetworkManager detects it and does ifdown of the
>>> interface. And corosync is unable to handle ifdown properly. Ifdown is bad
>>> with single ring, but it's just killer with RRP (127.0.0.1 poisons every
>>> node in the cluster).
>>>
>>> fails to recover the initial state:
>>>>
>>>> 1. Initial scenario:
>>>>
>>>> # corosync-cfgtool -s
>>>> Printing ring status.
>>>> Local node ID 1
>>>> RING ID 0
>>>>           id      = 192.168.0.1
>>>>           status  = ring 0 active with no faults
>>>> RING ID 1
>>>>           id      = 192.168.1.1
>>>>           status  = ring 1 active with no faults
>>>>
>>>>
>>>> 2. When I shutdown the node 2, all continues with no faults. Sometimes
>>>> the
>>>> ring ID's are bonding with 127.0.0.1 and then bond back to their
>>>> respective
>>>> heartbeat IP.
>>>>
>>>
>>> Again, result of ifdown.
>>>
>>>
>>>> 3. When node 2 is back online:
>>>>
>>>> # corosync-cfgtool -s
>>>> Printing ring status.
>>>> Local node ID 1
>>>> RING ID 0
>>>>           id      = 192.168.0.1
>>>>           status  = ring 0 active with no faults
>>>> RING ID 1
>>>>           id      = 192.168.1.1
>>>>           status  = Marking ringid 1 interface 192.168.1.1 FAULTY
>>>>
>>>>
>>>> # service corosync status
>>>> ● corosync.service - Corosync Cluster Engine
>>>>      Loaded: loaded (/lib/systemd/system/corosync.service; enabled;
>>>> vendor
>>>> preset: enabled)
>>>>      Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min
>>>> 38s ago
>>>>        Docs: man:corosync
>>>>              man:corosync.conf
>>>>              man:corosync_overview
>>>>    Main PID: 1439 (corosync)
>>>>       Tasks: 2 (limit: 4915)
>>>>      CGroup: /system.slice/corosync.service
>>>>              └─1439 /usr/sbin/corosync -f
>>>>
>>>>
>>>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ]
>>>> The
>>>> network interface [192.168.0.1] is now up.
>>>> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
>>>> [192.168.0.1] is now up.
>>>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ]
>>>> The
>>>> network interface [192.168.1.1] is now up.
>>>> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
>>>> [192.168.1.1] is now up.
>>>> Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice  [TOTEM ] A
>>>> new membership (192.168.0.1:601760) was formed. Members
>>>> Aug 22 14:44:26 node1 corosync[1439]:   [TOTEM ] A new membership (
>>>> 192.168.0.1:601760) was formed. Members
>>>> Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice  [TOTEM ] A
>>>> new membership (192.168.0.1:601764) was formed. Members joined: 2
>>>> Aug 22 14:44:32 node1 corosync[1439]:   [TOTEM ] A new membership (
>>>> 192.168.0.1:601764) was formed. Members joined: 2
>>>> Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error   [TOTEM ]
>>>> Marking ringid 1 interface 192.168.1.1 FAULTY
>>>> Aug 22 14:44:34 node1 corosync[1439]:   [TOTEM ] Marking ringid 1
>>>> interface
>>>> 192.168.1.1 FAULTY
>>>>
>>>>
>>>> If I execute corosync-cfgtool, clears the faulty error but after some
>>>> seconds return to be FAULTY.
>>>> The only thing that it resolves the problem is to restart de service with
>>>> service corosync restart.
>>>>
>>>> Here you have some of my configuration settings on node 1 (I probed
>>>> already
>>>> to change rrp_mode):
>>>>
>>>> *- corosync.conf*
>>>>
>>>>
>>>> totem {
>>>>           version: 2
>>>>           cluster_name: node
>>>>           token: 5000
>>>>           token_retransmits_before_loss_const: 10
>>>>           secauth: off
>>>>           threads: 0
>>>>           rrp_mode: passive
>>>>           nodeid: 1
>>>>           interface {
>>>>                   ringnumber: 0
>>>>                   bindnetaddr: 192.168.0.0
>>>>                   #mcastaddr: 226.94.1.1
>>>>                   mcastport: 5405
>>>>                   broadcast: yes
>>>>           }
>>>>           interface {
>>>>                   ringnumber: 1
>>>>                   bindnetaddr: 192.168.1.0
>>>>                   #mcastaddr: 226.94.1.2
>>>>                   mcastport: 5407
>>>>                   broadcast: yes
>>>>           }
>>>> }
>>>>
>>>> logging {
>>>>           fileline: off
>>>>           to_stderr: yes
>>>>           to_syslog: yes
>>>>           to_logfile: yes
>>>>           logfile: /var/log/corosync/corosync.log
>>>>           debug: off
>>>>           timestamp: on
>>>>           logger_subsys {
>>>>                   subsys: AMF
>>>>                   debug: off
>>>>           }
>>>> }
>>>>
>>>> amf {
>>>>           mode: disabled
>>>> }
>>>>
>>>> quorum {
>>>>           provider: corosync_votequorum
>>>>           expected_votes: 2
>>>> }
>>>>
>>>> nodelist {
>>>>           node {
>>>>                   nodeid: 1
>>>>                   ring0_addr: 192.168.0.1
>>>>                   ring1_addr: 192.168.1.1
>>>>           }
>>>>
>>>>           node {
>>>>                   nodeid: 2
>>>>                   ring0_addr: 192.168.0.2
>>>>                   ring1_addr: 192.168.1.2
>>>>           }
>>>> }
>>>>
>>>> aisexec {
>>>>           user: root
>>>>           group: root
>>>> }
>>>>
>>>> service {
>>>>           name: pacemaker
>>>>           ver: 1
>>>> }
>>>>
>>>>
>>>>
>>>> *- /etc/hosts*
>>>>
>>>>
>>>> 127.0.0.1       localhost
>>>> 10.4.172.5      node1.upc.edu node1
>>>> 10.4.172.6      node2.upc.edu node2
>>>>
>>>>
>>> So machines have 3 NICs? 2 for corosync/cluster traffic and one for
>>> regular traffic/services/outside world?
>>>
>>>
>>>> Thank you for you help in advance!
>>>>
>>>
>>> To conclude:
>>> - If you are using NetworkManager, try to install
>>> NetworkManager-config-server, it will probably help
>>> - If you are brave enough, try corosync 3.x (current Alpha4 is pretty
>>> stable - actually some other projects gain this stability with SP1 :) )
>>> that has no RRP but uses knet for support redundant links (up-to 8 links
>>> can be configured) and doesn't have problems with ifdown.
>>>
>>> Honza
>>>
>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Users mailing list: Users at clusterlabs.org
>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>>
>>>
>>
>>
>> --
>> *David Tolosa Martínez*
>> Customer Support & Infrastructure
>> UPCnet - Edifici Vèrtex
>> Plaça d'Eusebi Güell, 6, 08034 Barcelona
>> Tel: 934054555
>>
>> <https://www.upcnet.es>
>>
> 
> 
>