[ClusterLabs] Redundant ring not recovering after node is back

Thu Aug 23 03:11:51 EDT 2018

I'm currently using an Ubuntu 18.04 server configuration with netplan.

Here you have my current YAML configuration:

# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
  version: 2
  renderer: networkd
  ethernets:
    eno1:
      addresses: [192.168.0.1/24]
    enp4s0f0:
      addresses: [192.168.1.1/24]
    enp5s0f0:
      {}
  vlans:
    vlan.XXX:
      id: XXX
      link: enp5s0f0
      addresses: [ 10.1.128.5/29 ]
      gateway4: 10.1.128.1
      nameservers:
        addresses: [ 8.8.8.8, 8.8.4.4 ]
        search: [ foo.com, bar.com ]
    vlan.YYY:
      id: YYY
      link: enp5s0f0
      addresses: [ 10.1.128.5/29 ]

So, eno1 and enp4s0f0 are the two ethernet ports connected each other with
crossover cables to node2.
enp5s0f0 port is used to connect outside/services using vlans defined in
the same file.

In short, I'm using systemd-networkd default Ubuntu 18 server service for
manage networks. Im not detecting any NetworkManager-config-server package
in my repository neither.
So the only solution that I have left, I suppose, is to test corosync 3.x
and see if it works better handling RRP.

Thank you for your quick response!

2018-08-23 8:40 GMT+02:00 Jan Friesse <jfriesse at redhat.com>:

> David,
>
> Hello,
>> Im getting crazy about this problem, that I expect to resolve here, with
>> your help guys:
>>
>> I have 2 nodes with Corosync redundant ring feature.
>>
>> Each node has 2 similarly connected/configured NIC's. Both nodes are
>> connected each other by two crossover cables.
>>
>
> I believe this is root of the problem. Are you using NetworkManager? If
> so, have you installed NetworkManager-config-server? If not, please install
> it and test again.
>
>
>> I configured both nodes with rrp mode passive. Everything is working well
>> at this point, but when I shutdown 1 node to test failover, and this node
>> > returns to be online, corosync is marking the interface as FAULTY and rrp
>>
>
> I believe it's because with crossover cables configuration when other side
> is shutdown, NetworkManager detects it and does ifdown of the interface.
> And corosync is unable to handle ifdown properly. Ifdown is bad with single
> ring, but it's just killer with RRP (127.0.0.1 poisons every node in the
> cluster).
>
> fails to recover the initial state:
>>
>> 1. Initial scenario:
>>
>> # corosync-cfgtool -s
>> Printing ring status.
>> Local node ID 1
>> RING ID 0
>>          id      = 192.168.0.1
>>          status  = ring 0 active with no faults
>> RING ID 1
>>          id      = 192.168.1.1
>>          status  = ring 1 active with no faults
>>
>>
>> 2. When I shutdown the node 2, all continues with no faults. Sometimes the
>> ring ID's are bonding with 127.0.0.1 and then bond back to their
>> respective
>> heartbeat IP.
>>
>
> Again, result of ifdown.
>
>
>> 3. When node 2 is back online:
>>
>> # corosync-cfgtool -s
>> Printing ring status.
>> Local node ID 1
>> RING ID 0
>>          id      = 192.168.0.1
>>          status  = ring 0 active with no faults
>> RING ID 1
>>          id      = 192.168.1.1
>>          status  = Marking ringid 1 interface 192.168.1.1 FAULTY
>>
>>
>> # service corosync status
>> ● corosync.service - Corosync Cluster Engine
>>     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor
>> preset: enabled)
>>     Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min 38s
>> ago
>>       Docs: man:corosync
>>             man:corosync.conf
>>             man:corosync_overview
>>   Main PID: 1439 (corosync)
>>      Tasks: 2 (limit: 4915)
>>     CGroup: /system.slice/corosync.service
>>             └─1439 /usr/sbin/corosync -f
>>
>>
>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
>> network interface [192.168.0.1] is now up.
>> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
>> [192.168.0.1] is now up.
>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
>> network interface [192.168.1.1] is now up.
>> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
>> [192.168.1.1] is now up.
>> Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice  [TOTEM ] A
>> new membership (192.168.0.1:601760) was formed. Members
>> Aug 22 14:44:26 node1 corosync[1439]:   [TOTEM ] A new membership (
>> 192.168.0.1:601760) was formed. Members
>> Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice  [TOTEM ] A
>> new membership (192.168.0.1:601764) was formed. Members joined: 2
>> Aug 22 14:44:32 node1 corosync[1439]:   [TOTEM ] A new membership (
>> 192.168.0.1:601764) was formed. Members joined: 2
>> Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error   [TOTEM ]
>> Marking ringid 1 interface 192.168.1.1 FAULTY
>> Aug 22 14:44:34 node1 corosync[1439]:   [TOTEM ] Marking ringid 1
>> interface
>> 192.168.1.1 FAULTY
>>
>>
>> If I execute corosync-cfgtool, clears the faulty error but after some
>> seconds return to be FAULTY.
>> The only thing that it resolves the problem is to restart de service with
>> service corosync restart.
>>
>> Here you have some of my configuration settings on node 1 (I probed
>> already
>> to change rrp_mode):
>>
>> *- corosync.conf*
>>
>>
>> totem {
>>          version: 2
>>          cluster_name: node
>>          token: 5000
>>          token_retransmits_before_loss_const: 10
>>          secauth: off
>>          threads: 0
>>          rrp_mode: passive
>>          nodeid: 1
>>          interface {
>>                  ringnumber: 0
>>                  bindnetaddr: 192.168.0.0
>>                  #mcastaddr: 226.94.1.1
>>                  mcastport: 5405
>>                  broadcast: yes
>>          }
>>          interface {
>>                  ringnumber: 1
>>                  bindnetaddr: 192.168.1.0
>>                  #mcastaddr: 226.94.1.2
>>                  mcastport: 5407
>>                  broadcast: yes
>>          }
>> }
>>
>> logging {
>>          fileline: off
>>          to_stderr: yes
>>          to_syslog: yes
>>          to_logfile: yes
>>          logfile: /var/log/corosync/corosync.log
>>          debug: off
>>          timestamp: on
>>          logger_subsys {
>>                  subsys: AMF
>>                  debug: off
>>          }
>> }
>>
>> amf {
>>          mode: disabled
>> }
>>
>> quorum {
>>          provider: corosync_votequorum
>>          expected_votes: 2
>> }
>>
>> nodelist {
>>          node {
>>                  nodeid: 1
>>                  ring0_addr: 192.168.0.1
>>                  ring1_addr: 192.168.1.1
>>          }
>>
>>          node {
>>                  nodeid: 2
>>                  ring0_addr: 192.168.0.2
>>                  ring1_addr: 192.168.1.2
>>          }
>> }
>>
>> aisexec {
>>          user: root
>>          group: root
>> }
>>
>> service {
>>          name: pacemaker
>>          ver: 1
>> }
>>
>>
>>
>> *- /etc/hosts*
>>
>>
>> 127.0.0.1       localhost
>> 10.4.172.5      node1.upc.edu node1
>> 10.4.172.6      node2.upc.edu node2
>>
>>
> So machines have 3 NICs? 2 for corosync/cluster traffic and one for
> regular traffic/services/outside world?
>
>
>> Thank you for you help in advance!
>>
>
> To conclude:
> - If you are using NetworkManager, try to install
> NetworkManager-config-server, it will probably help
> - If you are brave enough, try corosync 3.x (current Alpha4 is pretty
> stable - actually some other projects gain this stability with SP1 :) )
> that has no RRP but uses knet for support redundant links (up-to 8 links
> can be configured) and doesn't have problems with ifdown.
>
> Honza
>
>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>

-- 
*David Tolosa Martínez*
Customer Support & Infrastructure
UPCnet - Edifici Vèrtex
Plaça d'Eusebi Güell, 6, 08034 Barcelona
Tel: 934054555

<https://www.upcnet.es>

-- 

INFORMACIÓ BÀSICA SOBRE PROTECCIÓ DE DADES:

Responsable: UPCNET, 
Serveis d'Accés a Internet de la Universitat Politècnica de Catalunya, SLU  
 |   Finalitat: gestionar els contactes i les relacions professionals i 
comercials amb els nostres clients i proveïdors   |   Base legal: 
consentiment, interès legítim i/o relació contractual   |   Destinataris: 
no seran comunicades a tercers excepte per obligació legal   |   Drets: 
pots exercir els teus drets d’accés, rectificació i supressió, així com els 
altres drets reconeguts a la normativa vigent, enviant-nos un missatge a 
privacy at upcnet.es <mailto:privacy at upcnet.es>   |   Més informació: consulta 
la nostra política completa de protecció de dades 
<https://www.upcnet.es/politica-de-privacitat>.

AVÍS DE 
CONFIDENCIALITAT <https://www.upcnet.es/ca/avis-de-confidencialitat>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180823/296e556c/attachment-0002.html>