[ClusterLabs] Redundant ring not recovering after node is back

David Tolosa david.tolosa at upcnet.es
Fri Aug 24 03:25:24 EDT 2018


How can I follow the first two solutions?
Regards,

2018-08-24 8:21 GMT+02:00 Jan Friesse <jfriesse at redhat.com>:

> I tried to install corosync 3.x and it works pretty well.
>>
>
> Cool
>
> But when I install pacemaker, it installs previous version of corosync as
>> dependency and breaks all the setup.
>> Any suggestions?
>>
>
> I can see at least following "solutions":
> - make proper Debian package
> - install corosync 3 to /usr/local
> - (ugly) install packaged corosync and reinstall by corosync 3 from source
>
> Regards,
>   Honza
>
>
>
>> 2018-08-23 9:32 GMT+02:00 Jan Friesse <jfriesse at redhat.com>:
>>
>> David,
>>>
>>> BTW, where I can download Corosync 3.x?
>>>
>>>> I've only seen Corosync 2.99.3 Alpha4 at http://corosync.github.io/coro
>>>> sync/
>>>>
>>>>
>>> Yes, that's Alpha 4 of Corosync 3.
>>>
>>>
>>>
>>>
>>> 2018-08-23 9:11 GMT+02:00 David Tolosa <david.tolosa at upcnet.es>:
>>>>
>>>> I'm currently using an Ubuntu 18.04 server configuration with netplan.
>>>>
>>>>>
>>>>> Here you have my current YAML configuration:
>>>>>
>>>>> # This file describes the network interfaces available on your system
>>>>> # For more information, see netplan(5).
>>>>> network:
>>>>>     version: 2
>>>>>     renderer: networkd
>>>>>     ethernets:
>>>>>       eno1:
>>>>>         addresses: [192.168.0.1/24]
>>>>>       enp4s0f0:
>>>>>         addresses: [192.168.1.1/24]
>>>>>       enp5s0f0:
>>>>>         {}
>>>>>     vlans:
>>>>>       vlan.XXX:
>>>>>         id: XXX
>>>>>         link: enp5s0f0
>>>>>         addresses: [ 10.1.128.5/29 ]
>>>>>         gateway4: 10.1.128.1
>>>>>         nameservers:
>>>>>           addresses: [ 8.8.8.8, 8.8.4.4 ]
>>>>>           search: [ foo.com, bar.com ]
>>>>>       vlan.YYY:
>>>>>         id: YYY
>>>>>         link: enp5s0f0
>>>>>         addresses: [ 10.1.128.5/29 ]
>>>>>
>>>>>
>>>>> So, eno1 and enp4s0f0 are the two ethernet ports connected each other
>>>>> with crossover cables to node2.
>>>>> enp5s0f0 port is used to connect outside/services using vlans defined
>>>>> in
>>>>> the same file.
>>>>>
>>>>> In short, I'm using systemd-networkd default Ubuntu 18 server service
>>>>> for
>>>>>
>>>>>
>>>> Ok, so systemd-networkd is really doing ifdown and somebody actually
>>> tries
>>> fix it and merge into upstream (sadly with not too much luck :( )
>>>
>>> https://github.com/systemd/systemd/pull/7403
>>>
>>>
>>> manage networks. Im not detecting any NetworkManager-config-server
>>>
>>>> package in my repository neither.
>>>>>
>>>>>
>>>> I'm not sure how it's called in Debian based distributions, but it's
>>> just
>>> one small file in /etc, so you can extract it from RPM.
>>>
>>> So the only solution that I have left, I suppose, is to test corosync 3.x
>>>
>>>> and see if it works better handling RRP.
>>>>>
>>>>>
>>>> You may also reconsider to try ether completely static network
>>> configuration or NetworkManager + NetworkManager-config-server.
>>>
>>>
>>> Corosync 3.x with knet will work for sure, but be prepared for quite a
>>> long compile path, because you first have to compile knet and then
>>> corosync. What may help you a bit is that we have a ubuntu 18.04 in our
>>> jenkins, so it should be possible corosync build log
>>> https://ci.kronosnet.org/view/corosync/job/corosync-build-al
>>> l-voting/lastBuild/corosync-build-all-voting=ubuntu-18-04-lt
>>> s-x86-64/consoleText, knet build log https://ci.kronosnet.org/view/
>>> knet/job/knet-build-all-voting/lastBuild/knet-build-all-
>>> voting=ubuntu-18-04-lts-x86-64/consoleText).
>>>
>>> Also please consult http://people.redhat.com/ccaul
>>> fie/docs/KnetCorosync.pdf about changes in corosync configuration.
>>>
>>> Regards,
>>>    Honza
>>>
>>>
>>> Thank you for your quick response!
>>>>>
>>>>> 2018-08-23 8:40 GMT+02:00 Jan Friesse <jfriesse at redhat.com>:
>>>>>
>>>>> David,
>>>>>
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> Im getting crazy about this problem, that I expect to resolve here,
>>>>>>> with
>>>>>>> your help guys:
>>>>>>>
>>>>>>> I have 2 nodes with Corosync redundant ring feature.
>>>>>>>
>>>>>>> Each node has 2 similarly connected/configured NIC's. Both nodes are
>>>>>>> connected each other by two crossover cables.
>>>>>>>
>>>>>>>
>>>>>>> I believe this is root of the problem. Are you using NetworkManager?
>>>>>> If
>>>>>> so, have you installed NetworkManager-config-server? If not, please
>>>>>> install
>>>>>> it and test again.
>>>>>>
>>>>>>
>>>>>> I configured both nodes with rrp mode passive. Everything is working
>>>>>>
>>>>>>> well
>>>>>>> at this point, but when I shutdown 1 node to test failover, and this
>>>>>>> node > returns to be online, corosync is marking the interface as
>>>>>>> FAULTY
>>>>>>> and rrp
>>>>>>>
>>>>>>>
>>>>>>> I believe it's because with crossover cables configuration when other
>>>>>> side is shutdown, NetworkManager detects it and does ifdown of the
>>>>>> interface. And corosync is unable to handle ifdown properly. Ifdown is
>>>>>> bad
>>>>>> with single ring, but it's just killer with RRP (127.0.0.1 poisons
>>>>>> every
>>>>>> node in the cluster).
>>>>>>
>>>>>> fails to recover the initial state:
>>>>>>
>>>>>>
>>>>>>> 1. Initial scenario:
>>>>>>>
>>>>>>> # corosync-cfgtool -s
>>>>>>> Printing ring status.
>>>>>>> Local node ID 1
>>>>>>> RING ID 0
>>>>>>>            id      = 192.168.0.1
>>>>>>>            status  = ring 0 active with no faults
>>>>>>> RING ID 1
>>>>>>>            id      = 192.168.1.1
>>>>>>>            status  = ring 1 active with no faults
>>>>>>>
>>>>>>>
>>>>>>> 2. When I shutdown the node 2, all continues with no faults.
>>>>>>> Sometimes
>>>>>>> the
>>>>>>> ring ID's are bonding with 127.0.0.1 and then bond back to their
>>>>>>> respective
>>>>>>> heartbeat IP.
>>>>>>>
>>>>>>>
>>>>>>> Again, result of ifdown.
>>>>>>
>>>>>>
>>>>>> 3. When node 2 is back online:
>>>>>>
>>>>>>>
>>>>>>> # corosync-cfgtool -s
>>>>>>> Printing ring status.
>>>>>>> Local node ID 1
>>>>>>> RING ID 0
>>>>>>>            id      = 192.168.0.1
>>>>>>>            status  = ring 0 active with no faults
>>>>>>> RING ID 1
>>>>>>>            id      = 192.168.1.1
>>>>>>>            status  = Marking ringid 1 interface 192.168.1.1 FAULTY
>>>>>>>
>>>>>>>
>>>>>>> # service corosync status
>>>>>>> ● corosync.service - Corosync Cluster Engine
>>>>>>>       Loaded: loaded (/lib/systemd/system/corosync.service; enabled;
>>>>>>> vendor
>>>>>>> preset: enabled)
>>>>>>>       Active: active (running) since Wed 2018-08-22 14:44:09 CEST;
>>>>>>> 1min
>>>>>>> 38s ago
>>>>>>>         Docs: man:corosync
>>>>>>>               man:corosync.conf
>>>>>>>               man:corosync_overview
>>>>>>>     Main PID: 1439 (corosync)
>>>>>>>        Tasks: 2 (limit: 4915)
>>>>>>>       CGroup: /system.slice/corosync.service
>>>>>>>               └─1439 /usr/sbin/corosync -f
>>>>>>>
>>>>>>>
>>>>>>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM
>>>>>>> ]
>>>>>>> The
>>>>>>> network interface [192.168.0.1] is now up.
>>>>>>> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network
>>>>>>> interface
>>>>>>> [192.168.0.1] is now up.
>>>>>>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM
>>>>>>> ]
>>>>>>> The
>>>>>>> network interface [192.168.1.1] is now up.
>>>>>>> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network
>>>>>>> interface
>>>>>>> [192.168.1.1] is now up.
>>>>>>> Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice  [TOTEM
>>>>>>> ]
>>>>>>> A
>>>>>>> new membership (192.168.0.1:601760) was formed. Members
>>>>>>> Aug 22 14:44:26 node1 corosync[1439]:   [TOTEM ] A new membership (
>>>>>>> 192.168.0.1:601760) was formed. Members
>>>>>>> Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice  [TOTEM
>>>>>>> ]
>>>>>>> A
>>>>>>> new membership (192.168.0.1:601764) was formed. Members joined: 2
>>>>>>> Aug 22 14:44:32 node1 corosync[1439]:   [TOTEM ] A new membership (
>>>>>>> 192.168.0.1:601764) was formed. Members joined: 2
>>>>>>> Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error   [TOTEM
>>>>>>> ]
>>>>>>> Marking ringid 1 interface 192.168.1.1 FAULTY
>>>>>>> Aug 22 14:44:34 node1 corosync[1439]:   [TOTEM ] Marking ringid 1
>>>>>>> interface
>>>>>>> 192.168.1.1 FAULTY
>>>>>>>
>>>>>>>
>>>>>>> If I execute corosync-cfgtool, clears the faulty error but after some
>>>>>>> seconds return to be FAULTY.
>>>>>>> The only thing that it resolves the problem is to restart de service
>>>>>>> with
>>>>>>> service corosync restart.
>>>>>>>
>>>>>>> Here you have some of my configuration settings on node 1 (I probed
>>>>>>> already
>>>>>>> to change rrp_mode):
>>>>>>>
>>>>>>> *- corosync.conf*
>>>>>>>
>>>>>>>
>>>>>>> totem {
>>>>>>>            version: 2
>>>>>>>            cluster_name: node
>>>>>>>            token: 5000
>>>>>>>            token_retransmits_before_loss_const: 10
>>>>>>>            secauth: off
>>>>>>>            threads: 0
>>>>>>>            rrp_mode: passive
>>>>>>>            nodeid: 1
>>>>>>>            interface {
>>>>>>>                    ringnumber: 0
>>>>>>>                    bindnetaddr: 192.168.0.0
>>>>>>>                    #mcastaddr: 226.94.1.1
>>>>>>>                    mcastport: 5405
>>>>>>>                    broadcast: yes
>>>>>>>            }
>>>>>>>            interface {
>>>>>>>                    ringnumber: 1
>>>>>>>                    bindnetaddr: 192.168.1.0
>>>>>>>                    #mcastaddr: 226.94.1.2
>>>>>>>                    mcastport: 5407
>>>>>>>                    broadcast: yes
>>>>>>>            }
>>>>>>> }
>>>>>>>
>>>>>>> logging {
>>>>>>>            fileline: off
>>>>>>>            to_stderr: yes
>>>>>>>            to_syslog: yes
>>>>>>>            to_logfile: yes
>>>>>>>            logfile: /var/log/corosync/corosync.log
>>>>>>>            debug: off
>>>>>>>            timestamp: on
>>>>>>>            logger_subsys {
>>>>>>>                    subsys: AMF
>>>>>>>                    debug: off
>>>>>>>            }
>>>>>>> }
>>>>>>>
>>>>>>> amf {
>>>>>>>            mode: disabled
>>>>>>> }
>>>>>>>
>>>>>>> quorum {
>>>>>>>            provider: corosync_votequorum
>>>>>>>            expected_votes: 2
>>>>>>> }
>>>>>>>
>>>>>>> nodelist {
>>>>>>>            node {
>>>>>>>                    nodeid: 1
>>>>>>>                    ring0_addr: 192.168.0.1
>>>>>>>                    ring1_addr: 192.168.1.1
>>>>>>>            }
>>>>>>>
>>>>>>>            node {
>>>>>>>                    nodeid: 2
>>>>>>>                    ring0_addr: 192.168.0.2
>>>>>>>                    ring1_addr: 192.168.1.2
>>>>>>>            }
>>>>>>> }
>>>>>>>
>>>>>>> aisexec {
>>>>>>>            user: root
>>>>>>>            group: root
>>>>>>> }
>>>>>>>
>>>>>>> service {
>>>>>>>            name: pacemaker
>>>>>>>            ver: 1
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *- /etc/hosts*
>>>>>>>
>>>>>>>
>>>>>>> 127.0.0.1       localhost
>>>>>>> 10.4.172.5      node1.upc.edu node1
>>>>>>> 10.4.172.6      node2.upc.edu node2
>>>>>>>
>>>>>>>
>>>>>>> So machines have 3 NICs? 2 for corosync/cluster traffic and one for
>>>>>>>
>>>>>> regular traffic/services/outside world?
>>>>>>
>>>>>>
>>>>>> Thank you for you help in advance!
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> To conclude:
>>>>>> - If you are using NetworkManager, try to install
>>>>>> NetworkManager-config-server, it will probably help
>>>>>> - If you are brave enough, try corosync 3.x (current Alpha4 is pretty
>>>>>> stable - actually some other projects gain this stability with SP1 :)
>>>>>> )
>>>>>> that has no RRP but uses knet for support redundant links (up-to 8
>>>>>> links
>>>>>> can be configured) and doesn't have problems with ifdown.
>>>>>>
>>>>>> Honza
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>
>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>> Getting started: http://www.clusterlabs.org/doc
>>>>>>> /Cluster_from_Scratch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>> --
>>>>> *David Tolosa Martínez*
>>>>> Customer Support & Infrastructure
>>>>> UPCnet - Edifici Vèrtex
>>>>> Plaça d'Eusebi Güell, 6, 08034 Barcelona
>>>>> Tel: 934054555
>>>>>
>>>>> <https://www.upcnet.es>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>


-- 
*David Tolosa Martínez*
Customer Support & Infrastructure
UPCnet - Edifici Vèrtex
Plaça d'Eusebi Güell, 6, 08034 Barcelona
Tel: 934054555

<https://www.upcnet.es>

-- 




INFORMACIÓ BÀSICA SOBRE PROTECCIÓ DE DADES:

 

Responsable: UPCNET, 
Serveis d'Accés a Internet de la Universitat Politècnica de Catalunya, SLU  
 |   Finalitat: gestionar els contactes i les relacions professionals i 
comercials amb els nostres clients i proveïdors   |   Base legal: 
consentiment, interès legítim i/o relació contractual   |   Destinataris: 
no seran comunicades a tercers excepte per obligació legal   |   Drets: 
pots exercir els teus drets d’accés, rectificació i supressió, així com els 
altres drets reconeguts a la normativa vigent, enviant-nos un missatge a 
privacy at upcnet.es <mailto:privacy at upcnet.es>   |   Més informació: consulta 
la nostra política completa de protecció de dades 
<https://www.upcnet.es/politica-de-privacitat>.

 

AVÍS DE 
CONFIDENCIALITAT <https://www.upcnet.es/ca/avis-de-confidencialitat>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180824/80cfb7da/attachment-0002.html>


More information about the Users mailing list