[ClusterLabs] Redundant ring not recovering after node is back

Thu Aug 23 03:15:52 EDT 2018

BTW, where I can download Corosync 3.x?
I've only seen Corosync 2.99.3 Alpha4 at http://corosync.github.io/corosync/

2018-08-23 9:11 GMT+02:00 David Tolosa <david.tolosa at upcnet.es>:

> I'm currently using an Ubuntu 18.04 server configuration with netplan.
>
> Here you have my current YAML configuration:
>
> # This file describes the network interfaces available on your system
> # For more information, see netplan(5).
> network:
>   version: 2
>   renderer: networkd
>   ethernets:
>     eno1:
>       addresses: [192.168.0.1/24]
>     enp4s0f0:
>       addresses: [192.168.1.1/24]
>     enp5s0f0:
>       {}
>   vlans:
>     vlan.XXX:
>       id: XXX
>       link: enp5s0f0
>       addresses: [ 10.1.128.5/29 ]
>       gateway4: 10.1.128.1
>       nameservers:
>         addresses: [ 8.8.8.8, 8.8.4.4 ]
>         search: [ foo.com, bar.com ]
>     vlan.YYY:
>       id: YYY
>       link: enp5s0f0
>       addresses: [ 10.1.128.5/29 ]
>
>
> So, eno1 and enp4s0f0 are the two ethernet ports connected each other
> with crossover cables to node2.
> enp5s0f0 port is used to connect outside/services using vlans defined in
> the same file.
>
> In short, I'm using systemd-networkd default Ubuntu 18 server service for
> manage networks. Im not detecting any NetworkManager-config-server
> package in my repository neither.
> So the only solution that I have left, I suppose, is to test corosync 3.x
> and see if it works better handling RRP.
>
> Thank you for your quick response!
>
> 2018-08-23 8:40 GMT+02:00 Jan Friesse <jfriesse at redhat.com>:
>
>> David,
>>
>> Hello,
>>> Im getting crazy about this problem, that I expect to resolve here, with
>>> your help guys:
>>>
>>> I have 2 nodes with Corosync redundant ring feature.
>>>
>>> Each node has 2 similarly connected/configured NIC's. Both nodes are
>>> connected each other by two crossover cables.
>>>
>>
>> I believe this is root of the problem. Are you using NetworkManager? If
>> so, have you installed NetworkManager-config-server? If not, please install
>> it and test again.
>>
>>
>>> I configured both nodes with rrp mode passive. Everything is working well
>>> at this point, but when I shutdown 1 node to test failover, and this
>>> node > returns to be online, corosync is marking the interface as FAULTY
>>> and rrp
>>>
>>
>> I believe it's because with crossover cables configuration when other
>> side is shutdown, NetworkManager detects it and does ifdown of the
>> interface. And corosync is unable to handle ifdown properly. Ifdown is bad
>> with single ring, but it's just killer with RRP (127.0.0.1 poisons every
>> node in the cluster).
>>
>> fails to recover the initial state:
>>>
>>> 1. Initial scenario:
>>>
>>> # corosync-cfgtool -s
>>> Printing ring status.
>>> Local node ID 1
>>> RING ID 0
>>>          id      = 192.168.0.1
>>>          status  = ring 0 active with no faults
>>> RING ID 1
>>>          id      = 192.168.1.1
>>>          status  = ring 1 active with no faults
>>>
>>>
>>> 2. When I shutdown the node 2, all continues with no faults. Sometimes
>>> the
>>> ring ID's are bonding with 127.0.0.1 and then bond back to their
>>> respective
>>> heartbeat IP.
>>>
>>
>> Again, result of ifdown.
>>
>>
>>> 3. When node 2 is back online:
>>>
>>> # corosync-cfgtool -s
>>> Printing ring status.
>>> Local node ID 1
>>> RING ID 0
>>>          id      = 192.168.0.1
>>>          status  = ring 0 active with no faults
>>> RING ID 1
>>>          id      = 192.168.1.1
>>>          status  = Marking ringid 1 interface 192.168.1.1 FAULTY
>>>
>>>
>>> # service corosync status
>>> ● corosync.service - Corosync Cluster Engine
>>>     Loaded: loaded (/lib/systemd/system/corosync.service; enabled;
>>> vendor
>>> preset: enabled)
>>>     Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min
>>> 38s ago
>>>       Docs: man:corosync
>>>             man:corosync.conf
>>>             man:corosync_overview
>>>   Main PID: 1439 (corosync)
>>>      Tasks: 2 (limit: 4915)
>>>     CGroup: /system.slice/corosync.service
>>>             └─1439 /usr/sbin/corosync -f
>>>
>>>
>>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ]
>>> The
>>> network interface [192.168.0.1] is now up.
>>> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
>>> [192.168.0.1] is now up.
>>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ]
>>> The
>>> network interface [192.168.1.1] is now up.
>>> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
>>> [192.168.1.1] is now up.
>>> Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice  [TOTEM ] A
>>> new membership (192.168.0.1:601760) was formed. Members
>>> Aug 22 14:44:26 node1 corosync[1439]:   [TOTEM ] A new membership (
>>> 192.168.0.1:601760) was formed. Members
>>> Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice  [TOTEM ] A
>>> new membership (192.168.0.1:601764) was formed. Members joined: 2
>>> Aug 22 14:44:32 node1 corosync[1439]:   [TOTEM ] A new membership (
>>> 192.168.0.1:601764) was formed. Members joined: 2
>>> Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error   [TOTEM ]
>>> Marking ringid 1 interface 192.168.1.1 FAULTY
>>> Aug 22 14:44:34 node1 corosync[1439]:   [TOTEM ] Marking ringid 1
>>> interface
>>> 192.168.1.1 FAULTY
>>>
>>>
>>> If I execute corosync-cfgtool, clears the faulty error but after some
>>> seconds return to be FAULTY.
>>> The only thing that it resolves the problem is to restart de service with
>>> service corosync restart.
>>>
>>> Here you have some of my configuration settings on node 1 (I probed
>>> already
>>> to change rrp_mode):
>>>
>>> *- corosync.conf*
>>>
>>>
>>> totem {
>>>          version: 2
>>>          cluster_name: node
>>>          token: 5000
>>>          token_retransmits_before_loss_const: 10
>>>          secauth: off
>>>          threads: 0
>>>          rrp_mode: passive
>>>          nodeid: 1
>>>          interface {
>>>                  ringnumber: 0
>>>                  bindnetaddr: 192.168.0.0
>>>                  #mcastaddr: 226.94.1.1
>>>                  mcastport: 5405
>>>                  broadcast: yes
>>>          }
>>>          interface {
>>>                  ringnumber: 1
>>>                  bindnetaddr: 192.168.1.0
>>>                  #mcastaddr: 226.94.1.2
>>>                  mcastport: 5407
>>>                  broadcast: yes
>>>          }
>>> }
>>>
>>> logging {
>>>          fileline: off
>>>          to_stderr: yes
>>>          to_syslog: yes
>>>          to_logfile: yes
>>>          logfile: /var/log/corosync/corosync.log
>>>          debug: off
>>>          timestamp: on
>>>          logger_subsys {
>>>                  subsys: AMF
>>>                  debug: off
>>>          }
>>> }
>>>
>>> amf {
>>>          mode: disabled
>>> }
>>>
>>> quorum {
>>>          provider: corosync_votequorum
>>>          expected_votes: 2
>>> }
>>>
>>> nodelist {
>>>          node {
>>>                  nodeid: 1
>>>                  ring0_addr: 192.168.0.1
>>>                  ring1_addr: 192.168.1.1
>>>          }
>>>
>>>          node {
>>>                  nodeid: 2
>>>                  ring0_addr: 192.168.0.2
>>>                  ring1_addr: 192.168.1.2
>>>          }
>>> }
>>>
>>> aisexec {
>>>          user: root
>>>          group: root
>>> }
>>>
>>> service {
>>>          name: pacemaker
>>>          ver: 1
>>> }
>>>
>>>
>>>
>>> *- /etc/hosts*
>>>
>>>
>>> 127.0.0.1       localhost
>>> 10.4.172.5      node1.upc.edu node1
>>> 10.4.172.6      node2.upc.edu node2
>>>
>>>
>> So machines have 3 NICs? 2 for corosync/cluster traffic and one for
>> regular traffic/services/outside world?
>>
>>
>>> Thank you for you help in advance!
>>>
>>
>> To conclude:
>> - If you are using NetworkManager, try to install
>> NetworkManager-config-server, it will probably help
>> - If you are brave enough, try corosync 3.x (current Alpha4 is pretty
>> stable - actually some other projects gain this stability with SP1 :) )
>> that has no RRP but uses knet for support redundant links (up-to 8 links
>> can be configured) and doesn't have problems with ifdown.
>>
>> Honza
>>
>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>
>
>
> --
> *David Tolosa Martínez*
> Customer Support & Infrastructure
> UPCnet - Edifici Vèrtex
> Plaça d'Eusebi Güell, 6, 08034 Barcelona
> Tel: 934054555
>
> <https://www.upcnet.es>
>

-- 
*David Tolosa Martínez*
Customer Support & Infrastructure
UPCnet - Edifici Vèrtex
Plaça d'Eusebi Güell, 6, 08034 Barcelona
Tel: 934054555

<https://www.upcnet.es>

-- 

INFORMACIÓ BÀSICA SOBRE PROTECCIÓ DE DADES:

Responsable: UPCNET, 
Serveis d'Accés a Internet de la Universitat Politècnica de Catalunya, SLU  
 |   Finalitat: gestionar els contactes i les relacions professionals i 
comercials amb els nostres clients i proveïdors   |   Base legal: 
consentiment, interès legítim i/o relació contractual   |   Destinataris: 
no seran comunicades a tercers excepte per obligació legal   |   Drets: 
pots exercir els teus drets d’accés, rectificació i supressió, així com els 
altres drets reconeguts a la normativa vigent, enviant-nos un missatge a 
privacy at upcnet.es <mailto:privacy at upcnet.es>   |   Més informació: consulta 
la nostra política completa de protecció de dades 
<https://www.upcnet.es/politica-de-privacitat>.

AVÍS DE 
CONFIDENCIALITAT <https://www.upcnet.es/ca/avis-de-confidencialitat>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180823/a601b0ea/attachment-0001.html>