[ClusterLabs] Redundant ring not recovering after node is back

Thu Aug 23 06:29:14 EDT 2018

I tried to install corosync 3.x and it works pretty well.
But when I install pacemaker, it installs previous version of corosync as
dependency and breaks all the setup.
Any suggestions?

2018-08-23 9:32 GMT+02:00 Jan Friesse <jfriesse at redhat.com>:

> David,
>
> BTW, where I can download Corosync 3.x?
>> I've only seen Corosync 2.99.3 Alpha4 at http://corosync.github.io/coro
>> sync/
>>
>
> Yes, that's Alpha 4 of Corosync 3.
>
>
>
>
>> 2018-08-23 9:11 GMT+02:00 David Tolosa <david.tolosa at upcnet.es>:
>>
>> I'm currently using an Ubuntu 18.04 server configuration with netplan.
>>>
>>> Here you have my current YAML configuration:
>>>
>>> # This file describes the network interfaces available on your system
>>> # For more information, see netplan(5).
>>> network:
>>>    version: 2
>>>    renderer: networkd
>>>    ethernets:
>>>      eno1:
>>>        addresses: [192.168.0.1/24]
>>>      enp4s0f0:
>>>        addresses: [192.168.1.1/24]
>>>      enp5s0f0:
>>>        {}
>>>    vlans:
>>>      vlan.XXX:
>>>        id: XXX
>>>        link: enp5s0f0
>>>        addresses: [ 10.1.128.5/29 ]
>>>        gateway4: 10.1.128.1
>>>        nameservers:
>>>          addresses: [ 8.8.8.8, 8.8.4.4 ]
>>>          search: [ foo.com, bar.com ]
>>>      vlan.YYY:
>>>        id: YYY
>>>        link: enp5s0f0
>>>        addresses: [ 10.1.128.5/29 ]
>>>
>>>
>>> So, eno1 and enp4s0f0 are the two ethernet ports connected each other
>>> with crossover cables to node2.
>>> enp5s0f0 port is used to connect outside/services using vlans defined in
>>> the same file.
>>>
>>> In short, I'm using systemd-networkd default Ubuntu 18 server service for
>>>
>>
> Ok, so systemd-networkd is really doing ifdown and somebody actually tries
> fix it and merge into upstream (sadly with not too much luck :( )
>
> https://github.com/systemd/systemd/pull/7403
>
>
> manage networks. Im not detecting any NetworkManager-config-server
>>> package in my repository neither.
>>>
>>
> I'm not sure how it's called in Debian based distributions, but it's just
> one small file in /etc, so you can extract it from RPM.
>
> So the only solution that I have left, I suppose, is to test corosync 3.x
>>> and see if it works better handling RRP.
>>>
>>
> You may also reconsider to try ether completely static network
> configuration or NetworkManager + NetworkManager-config-server.
>
>
> Corosync 3.x with knet will work for sure, but be prepared for quite a
> long compile path, because you first have to compile knet and then
> corosync. What may help you a bit is that we have a ubuntu 18.04 in our
> jenkins, so it should be possible corosync build log
> https://ci.kronosnet.org/view/corosync/job/corosync-build-al
> l-voting/lastBuild/corosync-build-all-voting=ubuntu-18-04-lt
> s-x86-64/consoleText, knet build log https://ci.kronosnet.org/view/
> knet/job/knet-build-all-voting/lastBuild/knet-build-all-
> voting=ubuntu-18-04-lts-x86-64/consoleText).
>
> Also please consult http://people.redhat.com/ccaul
> fie/docs/KnetCorosync.pdf about changes in corosync configuration.
>
> Regards,
>   Honza
>
>
>>> Thank you for your quick response!
>>>
>>> 2018-08-23 8:40 GMT+02:00 Jan Friesse <jfriesse at redhat.com>:
>>>
>>> David,
>>>>
>>>> Hello,
>>>>
>>>>> Im getting crazy about this problem, that I expect to resolve here,
>>>>> with
>>>>> your help guys:
>>>>>
>>>>> I have 2 nodes with Corosync redundant ring feature.
>>>>>
>>>>> Each node has 2 similarly connected/configured NIC's. Both nodes are
>>>>> connected each other by two crossover cables.
>>>>>
>>>>>
>>>> I believe this is root of the problem. Are you using NetworkManager? If
>>>> so, have you installed NetworkManager-config-server? If not, please
>>>> install
>>>> it and test again.
>>>>
>>>>
>>>> I configured both nodes with rrp mode passive. Everything is working
>>>>> well
>>>>> at this point, but when I shutdown 1 node to test failover, and this
>>>>> node > returns to be online, corosync is marking the interface as
>>>>> FAULTY
>>>>> and rrp
>>>>>
>>>>>
>>>> I believe it's because with crossover cables configuration when other
>>>> side is shutdown, NetworkManager detects it and does ifdown of the
>>>> interface. And corosync is unable to handle ifdown properly. Ifdown is
>>>> bad
>>>> with single ring, but it's just killer with RRP (127.0.0.1 poisons every
>>>> node in the cluster).
>>>>
>>>> fails to recover the initial state:
>>>>
>>>>>
>>>>> 1. Initial scenario:
>>>>>
>>>>> # corosync-cfgtool -s
>>>>> Printing ring status.
>>>>> Local node ID 1
>>>>> RING ID 0
>>>>>           id      = 192.168.0.1
>>>>>           status  = ring 0 active with no faults
>>>>> RING ID 1
>>>>>           id      = 192.168.1.1
>>>>>           status  = ring 1 active with no faults
>>>>>
>>>>>
>>>>> 2. When I shutdown the node 2, all continues with no faults. Sometimes
>>>>> the
>>>>> ring ID's are bonding with 127.0.0.1 and then bond back to their
>>>>> respective
>>>>> heartbeat IP.
>>>>>
>>>>>
>>>> Again, result of ifdown.
>>>>
>>>>
>>>> 3. When node 2 is back online:
>>>>>
>>>>> # corosync-cfgtool -s
>>>>> Printing ring status.
>>>>> Local node ID 1
>>>>> RING ID 0
>>>>>           id      = 192.168.0.1
>>>>>           status  = ring 0 active with no faults
>>>>> RING ID 1
>>>>>           id      = 192.168.1.1
>>>>>           status  = Marking ringid 1 interface 192.168.1.1 FAULTY
>>>>>
>>>>>
>>>>> # service corosync status
>>>>> ● corosync.service - Corosync Cluster Engine
>>>>>      Loaded: loaded (/lib/systemd/system/corosync.service; enabled;
>>>>> vendor
>>>>> preset: enabled)
>>>>>      Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min
>>>>> 38s ago
>>>>>        Docs: man:corosync
>>>>>              man:corosync.conf
>>>>>              man:corosync_overview
>>>>>    Main PID: 1439 (corosync)
>>>>>       Tasks: 2 (limit: 4915)
>>>>>      CGroup: /system.slice/corosync.service
>>>>>              └─1439 /usr/sbin/corosync -f
>>>>>
>>>>>
>>>>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ]
>>>>> The
>>>>> network interface [192.168.0.1] is now up.
>>>>> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
>>>>> [192.168.0.1] is now up.
>>>>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ]
>>>>> The
>>>>> network interface [192.168.1.1] is now up.
>>>>> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
>>>>> [192.168.1.1] is now up.
>>>>> Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice  [TOTEM ]
>>>>> A
>>>>> new membership (192.168.0.1:601760) was formed. Members
>>>>> Aug 22 14:44:26 node1 corosync[1439]:   [TOTEM ] A new membership (
>>>>> 192.168.0.1:601760) was formed. Members
>>>>> Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice  [TOTEM ]
>>>>> A
>>>>> new membership (192.168.0.1:601764) was formed. Members joined: 2
>>>>> Aug 22 14:44:32 node1 corosync[1439]:   [TOTEM ] A new membership (
>>>>> 192.168.0.1:601764) was formed. Members joined: 2
>>>>> Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error   [TOTEM ]
>>>>> Marking ringid 1 interface 192.168.1.1 FAULTY
>>>>> Aug 22 14:44:34 node1 corosync[1439]:   [TOTEM ] Marking ringid 1
>>>>> interface
>>>>> 192.168.1.1 FAULTY
>>>>>
>>>>>
>>>>> If I execute corosync-cfgtool, clears the faulty error but after some
>>>>> seconds return to be FAULTY.
>>>>> The only thing that it resolves the problem is to restart de service
>>>>> with
>>>>> service corosync restart.
>>>>>
>>>>> Here you have some of my configuration settings on node 1 (I probed
>>>>> already
>>>>> to change rrp_mode):
>>>>>
>>>>> *- corosync.conf*
>>>>>
>>>>>
>>>>> totem {
>>>>>           version: 2
>>>>>           cluster_name: node
>>>>>           token: 5000
>>>>>           token_retransmits_before_loss_const: 10
>>>>>           secauth: off
>>>>>           threads: 0
>>>>>           rrp_mode: passive
>>>>>           nodeid: 1
>>>>>           interface {
>>>>>                   ringnumber: 0
>>>>>                   bindnetaddr: 192.168.0.0
>>>>>                   #mcastaddr: 226.94.1.1
>>>>>                   mcastport: 5405
>>>>>                   broadcast: yes
>>>>>           }
>>>>>           interface {
>>>>>                   ringnumber: 1
>>>>>                   bindnetaddr: 192.168.1.0
>>>>>                   #mcastaddr: 226.94.1.2
>>>>>                   mcastport: 5407
>>>>>                   broadcast: yes
>>>>>           }
>>>>> }
>>>>>
>>>>> logging {
>>>>>           fileline: off
>>>>>           to_stderr: yes
>>>>>           to_syslog: yes
>>>>>           to_logfile: yes
>>>>>           logfile: /var/log/corosync/corosync.log
>>>>>           debug: off
>>>>>           timestamp: on
>>>>>           logger_subsys {
>>>>>                   subsys: AMF
>>>>>                   debug: off
>>>>>           }
>>>>> }
>>>>>
>>>>> amf {
>>>>>           mode: disabled
>>>>> }
>>>>>
>>>>> quorum {
>>>>>           provider: corosync_votequorum
>>>>>           expected_votes: 2
>>>>> }
>>>>>
>>>>> nodelist {
>>>>>           node {
>>>>>                   nodeid: 1
>>>>>                   ring0_addr: 192.168.0.1
>>>>>                   ring1_addr: 192.168.1.1
>>>>>           }
>>>>>
>>>>>           node {
>>>>>                   nodeid: 2
>>>>>                   ring0_addr: 192.168.0.2
>>>>>                   ring1_addr: 192.168.1.2
>>>>>           }
>>>>> }
>>>>>
>>>>> aisexec {
>>>>>           user: root
>>>>>           group: root
>>>>> }
>>>>>
>>>>> service {
>>>>>           name: pacemaker
>>>>>           ver: 1
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>> *- /etc/hosts*
>>>>>
>>>>>
>>>>> 127.0.0.1       localhost
>>>>> 10.4.172.5      node1.upc.edu node1
>>>>> 10.4.172.6      node2.upc.edu node2
>>>>>
>>>>>
>>>>> So machines have 3 NICs? 2 for corosync/cluster traffic and one for
>>>> regular traffic/services/outside world?
>>>>
>>>>
>>>> Thank you for you help in advance!
>>>>>
>>>>>
>>>> To conclude:
>>>> - If you are using NetworkManager, try to install
>>>> NetworkManager-config-server, it will probably help
>>>> - If you are brave enough, try corosync 3.x (current Alpha4 is pretty
>>>> stable - actually some other projects gain this stability with SP1 :) )
>>>> that has no RRP but uses knet for support redundant links (up-to 8 links
>>>> can be configured) and doesn't have problems with ifdown.
>>>>
>>>> Honza
>>>>
>>>>
>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list: Users at clusterlabs.org
>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc
>>>>> /Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>> --
>>> *David Tolosa Martínez*
>>> Customer Support & Infrastructure
>>> UPCnet - Edifici Vèrtex
>>> Plaça d'Eusebi Güell, 6, 08034 Barcelona
>>> Tel: 934054555
>>>
>>> <https://www.upcnet.es>
>>>
>>>
>>
>>
>>
>

-- 
*David Tolosa Martínez*
Customer Support & Infrastructure
UPCnet - Edifici Vèrtex
Plaça d'Eusebi Güell, 6, 08034 Barcelona
Tel: 934054555

<https://www.upcnet.es>

-- 

INFORMACIÓ BÀSICA SOBRE PROTECCIÓ DE DADES:

Responsable: UPCNET, 
Serveis d'Accés a Internet de la Universitat Politècnica de Catalunya, SLU  
 |   Finalitat: gestionar els contactes i les relacions professionals i 
comercials amb els nostres clients i proveïdors   |   Base legal: 
consentiment, interès legítim i/o relació contractual   |   Destinataris: 
no seran comunicades a tercers excepte per obligació legal   |   Drets: 
pots exercir els teus drets d’accés, rectificació i supressió, així com els 
altres drets reconeguts a la normativa vigent, enviant-nos un missatge a 
privacy at upcnet.es <mailto:privacy at upcnet.es>   |   Més informació: consulta 
la nostra política completa de protecció de dades 
<https://www.upcnet.es/politica-de-privacitat>.

AVÍS DE 
CONFIDENCIALITAT <https://www.upcnet.es/ca/avis-de-confidencialitat>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180823/214b2fed/attachment-0001.html>