[ClusterLabs] Redundant ring not recovering after node is back

Fri Aug 24 06:21:26 UTC 2018

> I tried to install corosync 3.x and it works pretty well.

Cool

> But when I install pacemaker, it installs previous version of corosync as
> dependency and breaks all the setup.
> Any suggestions?

I can see at least following "solutions":
- make proper Debian package
- install corosync 3 to /usr/local
- (ugly) install packaged corosync and reinstall by corosync 3 from source

Regards,
   Honza

> 
> 2018-08-23 9:32 GMT+02:00 Jan Friesse <jfriesse at redhat.com>:
> 
>> David,
>>
>> BTW, where I can download Corosync 3.x?
>>> I've only seen Corosync 2.99.3 Alpha4 at http://corosync.github.io/coro
>>> sync/
>>>
>>
>> Yes, that's Alpha 4 of Corosync 3.
>>
>>
>>
>>
>>> 2018-08-23 9:11 GMT+02:00 David Tolosa <david.tolosa at upcnet.es>:
>>>
>>> I'm currently using an Ubuntu 18.04 server configuration with netplan.
>>>>
>>>> Here you have my current YAML configuration:
>>>>
>>>> # This file describes the network interfaces available on your system
>>>> # For more information, see netplan(5).
>>>> network:
>>>>     version: 2
>>>>     renderer: networkd
>>>>     ethernets:
>>>>       eno1:
>>>>         addresses: [192.168.0.1/24]
>>>>       enp4s0f0:
>>>>         addresses: [192.168.1.1/24]
>>>>       enp5s0f0:
>>>>         {}
>>>>     vlans:
>>>>       vlan.XXX:
>>>>         id: XXX
>>>>         link: enp5s0f0
>>>>         addresses: [ 10.1.128.5/29 ]
>>>>         gateway4: 10.1.128.1
>>>>         nameservers:
>>>>           addresses: [ 8.8.8.8, 8.8.4.4 ]
>>>>           search: [ foo.com, bar.com ]
>>>>       vlan.YYY:
>>>>         id: YYY
>>>>         link: enp5s0f0
>>>>         addresses: [ 10.1.128.5/29 ]
>>>>
>>>>
>>>> So, eno1 and enp4s0f0 are the two ethernet ports connected each other
>>>> with crossover cables to node2.
>>>> enp5s0f0 port is used to connect outside/services using vlans defined in
>>>> the same file.
>>>>
>>>> In short, I'm using systemd-networkd default Ubuntu 18 server service for
>>>>
>>>
>> Ok, so systemd-networkd is really doing ifdown and somebody actually tries
>> fix it and merge into upstream (sadly with not too much luck :( )
>>
>> https://github.com/systemd/systemd/pull/7403
>>
>>
>> manage networks. Im not detecting any NetworkManager-config-server
>>>> package in my repository neither.
>>>>
>>>
>> I'm not sure how it's called in Debian based distributions, but it's just
>> one small file in /etc, so you can extract it from RPM.
>>
>> So the only solution that I have left, I suppose, is to test corosync 3.x
>>>> and see if it works better handling RRP.
>>>>
>>>
>> You may also reconsider to try ether completely static network
>> configuration or NetworkManager + NetworkManager-config-server.
>>
>>
>> Corosync 3.x with knet will work for sure, but be prepared for quite a
>> long compile path, because you first have to compile knet and then
>> corosync. What may help you a bit is that we have a ubuntu 18.04 in our
>> jenkins, so it should be possible corosync build log
>> https://ci.kronosnet.org/view/corosync/job/corosync-build-al
>> l-voting/lastBuild/corosync-build-all-voting=ubuntu-18-04-lt
>> s-x86-64/consoleText, knet build log https://ci.kronosnet.org/view/
>> knet/job/knet-build-all-voting/lastBuild/knet-build-all-
>> voting=ubuntu-18-04-lts-x86-64/consoleText).
>>
>> Also please consult http://people.redhat.com/ccaul
>> fie/docs/KnetCorosync.pdf about changes in corosync configuration.
>>
>> Regards,
>>    Honza
>>
>>
>>>> Thank you for your quick response!
>>>>
>>>> 2018-08-23 8:40 GMT+02:00 Jan Friesse <jfriesse at redhat.com>:
>>>>
>>>> David,
>>>>>
>>>>> Hello,
>>>>>
>>>>>> Im getting crazy about this problem, that I expect to resolve here,
>>>>>> with
>>>>>> your help guys:
>>>>>>
>>>>>> I have 2 nodes with Corosync redundant ring feature.
>>>>>>
>>>>>> Each node has 2 similarly connected/configured NIC's. Both nodes are
>>>>>> connected each other by two crossover cables.
>>>>>>
>>>>>>
>>>>> I believe this is root of the problem. Are you using NetworkManager? If
>>>>> so, have you installed NetworkManager-config-server? If not, please
>>>>> install
>>>>> it and test again.
>>>>>
>>>>>
>>>>> I configured both nodes with rrp mode passive. Everything is working
>>>>>> well
>>>>>> at this point, but when I shutdown 1 node to test failover, and this
>>>>>> node > returns to be online, corosync is marking the interface as
>>>>>> FAULTY
>>>>>> and rrp
>>>>>>
>>>>>>
>>>>> I believe it's because with crossover cables configuration when other
>>>>> side is shutdown, NetworkManager detects it and does ifdown of the
>>>>> interface. And corosync is unable to handle ifdown properly. Ifdown is
>>>>> bad
>>>>> with single ring, but it's just killer with RRP (127.0.0.1 poisons every
>>>>> node in the cluster).
>>>>>
>>>>> fails to recover the initial state:
>>>>>
>>>>>>
>>>>>> 1. Initial scenario:
>>>>>>
>>>>>> # corosync-cfgtool -s
>>>>>> Printing ring status.
>>>>>> Local node ID 1
>>>>>> RING ID 0
>>>>>>            id      = 192.168.0.1
>>>>>>            status  = ring 0 active with no faults
>>>>>> RING ID 1
>>>>>>            id      = 192.168.1.1
>>>>>>            status  = ring 1 active with no faults
>>>>>>
>>>>>>
>>>>>> 2. When I shutdown the node 2, all continues with no faults. Sometimes
>>>>>> the
>>>>>> ring ID's are bonding with 127.0.0.1 and then bond back to their
>>>>>> respective
>>>>>> heartbeat IP.
>>>>>>
>>>>>>
>>>>> Again, result of ifdown.
>>>>>
>>>>>
>>>>> 3. When node 2 is back online:
>>>>>>
>>>>>> # corosync-cfgtool -s
>>>>>> Printing ring status.
>>>>>> Local node ID 1
>>>>>> RING ID 0
>>>>>>            id      = 192.168.0.1
>>>>>>            status  = ring 0 active with no faults
>>>>>> RING ID 1
>>>>>>            id      = 192.168.1.1
>>>>>>            status  = Marking ringid 1 interface 192.168.1.1 FAULTY
>>>>>>
>>>>>>
>>>>>> # service corosync status
>>>>>> ● corosync.service - Corosync Cluster Engine
>>>>>>       Loaded: loaded (/lib/systemd/system/corosync.service; enabled;
>>>>>> vendor
>>>>>> preset: enabled)
>>>>>>       Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min
>>>>>> 38s ago
>>>>>>         Docs: man:corosync
>>>>>>               man:corosync.conf
>>>>>>               man:corosync_overview
>>>>>>     Main PID: 1439 (corosync)
>>>>>>        Tasks: 2 (limit: 4915)
>>>>>>       CGroup: /system.slice/corosync.service
>>>>>>               └─1439 /usr/sbin/corosync -f
>>>>>>
>>>>>>
>>>>>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ]
>>>>>> The
>>>>>> network interface [192.168.0.1] is now up.
>>>>>> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
>>>>>> [192.168.0.1] is now up.
>>>>>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ]
>>>>>> The
>>>>>> network interface [192.168.1.1] is now up.
>>>>>> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
>>>>>> [192.168.1.1] is now up.
>>>>>> Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice  [TOTEM ]
>>>>>> A
>>>>>> new membership (192.168.0.1:601760) was formed. Members
>>>>>> Aug 22 14:44:26 node1 corosync[1439]:   [TOTEM ] A new membership (
>>>>>> 192.168.0.1:601760) was formed. Members
>>>>>> Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice  [TOTEM ]
>>>>>> A
>>>>>> new membership (192.168.0.1:601764) was formed. Members joined: 2
>>>>>> Aug 22 14:44:32 node1 corosync[1439]:   [TOTEM ] A new membership (
>>>>>> 192.168.0.1:601764) was formed. Members joined: 2
>>>>>> Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error   [TOTEM ]
>>>>>> Marking ringid 1 interface 192.168.1.1 FAULTY
>>>>>> Aug 22 14:44:34 node1 corosync[1439]:   [TOTEM ] Marking ringid 1
>>>>>> interface
>>>>>> 192.168.1.1 FAULTY
>>>>>>
>>>>>>
>>>>>> If I execute corosync-cfgtool, clears the faulty error but after some
>>>>>> seconds return to be FAULTY.
>>>>>> The only thing that it resolves the problem is to restart de service
>>>>>> with
>>>>>> service corosync restart.
>>>>>>
>>>>>> Here you have some of my configuration settings on node 1 (I probed
>>>>>> already
>>>>>> to change rrp_mode):
>>>>>>
>>>>>> *- corosync.conf*
>>>>>>
>>>>>>
>>>>>> totem {
>>>>>>            version: 2
>>>>>>            cluster_name: node
>>>>>>            token: 5000
>>>>>>            token_retransmits_before_loss_const: 10
>>>>>>            secauth: off
>>>>>>            threads: 0
>>>>>>            rrp_mode: passive
>>>>>>            nodeid: 1
>>>>>>            interface {
>>>>>>                    ringnumber: 0
>>>>>>                    bindnetaddr: 192.168.0.0
>>>>>>                    #mcastaddr: 226.94.1.1
>>>>>>                    mcastport: 5405
>>>>>>                    broadcast: yes
>>>>>>            }
>>>>>>            interface {
>>>>>>                    ringnumber: 1
>>>>>>                    bindnetaddr: 192.168.1.0
>>>>>>                    #mcastaddr: 226.94.1.2
>>>>>>                    mcastport: 5407
>>>>>>                    broadcast: yes
>>>>>>            }
>>>>>> }
>>>>>>
>>>>>> logging {
>>>>>>            fileline: off
>>>>>>            to_stderr: yes
>>>>>>            to_syslog: yes
>>>>>>            to_logfile: yes
>>>>>>            logfile: /var/log/corosync/corosync.log
>>>>>>            debug: off
>>>>>>            timestamp: on
>>>>>>            logger_subsys {
>>>>>>                    subsys: AMF
>>>>>>                    debug: off
>>>>>>            }
>>>>>> }
>>>>>>
>>>>>> amf {
>>>>>>            mode: disabled
>>>>>> }
>>>>>>
>>>>>> quorum {
>>>>>>            provider: corosync_votequorum
>>>>>>            expected_votes: 2
>>>>>> }
>>>>>>
>>>>>> nodelist {
>>>>>>            node {
>>>>>>                    nodeid: 1
>>>>>>                    ring0_addr: 192.168.0.1
>>>>>>                    ring1_addr: 192.168.1.1
>>>>>>            }
>>>>>>
>>>>>>            node {
>>>>>>                    nodeid: 2
>>>>>>                    ring0_addr: 192.168.0.2
>>>>>>                    ring1_addr: 192.168.1.2
>>>>>>            }
>>>>>> }
>>>>>>
>>>>>> aisexec {
>>>>>>            user: root
>>>>>>            group: root
>>>>>> }
>>>>>>
>>>>>> service {
>>>>>>            name: pacemaker
>>>>>>            ver: 1
>>>>>> }
>>>>>>
>>>>>>
>>>>>>
>>>>>> *- /etc/hosts*
>>>>>>
>>>>>>
>>>>>> 127.0.0.1       localhost
>>>>>> 10.4.172.5      node1.upc.edu node1
>>>>>> 10.4.172.6      node2.upc.edu node2
>>>>>>
>>>>>>
>>>>>> So machines have 3 NICs? 2 for corosync/cluster traffic and one for
>>>>> regular traffic/services/outside world?
>>>>>
>>>>>
>>>>> Thank you for you help in advance!
>>>>>>
>>>>>>
>>>>> To conclude:
>>>>> - If you are using NetworkManager, try to install
>>>>> NetworkManager-config-server, it will probably help
>>>>> - If you are brave enough, try corosync 3.x (current Alpha4 is pretty
>>>>> stable - actually some other projects gain this stability with SP1 :) )
>>>>> that has no RRP but uses knet for support redundant links (up-to 8 links
>>>>> can be configured) and doesn't have problems with ifdown.
>>>>>
>>>>> Honza
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started: http://www.clusterlabs.org/doc
>>>>>> /Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> *David Tolosa Martínez*
>>>> Customer Support & Infrastructure
>>>> UPCnet - Edifici Vèrtex
>>>> Plaça d'Eusebi Güell, 6, 08034 Barcelona
>>>> Tel: 934054555
>>>>
>>>> <https://www.upcnet.es>
>>>>
>>>>
>>>
>>>
>>>
>>
> 
>