[ClusterLabs] Redundant ring not recovering after node is back

Wed Aug 22 08:53:25 EDT 2018

Hello,
Im getting crazy about this problem, that I expect to resolve here, with
your help guys:

I have 2 nodes with Corosync redundant ring feature.

Each node has 2 similarly connected/configured NIC's. Both nodes are
connected each other by two crossover cables.

I configured both nodes with rrp mode passive. Everything is working well
at this point, but when I shutdown 1 node to test failover, and this node
returns to be online, corosync is marking the interface as FAULTY and rrp
fails to recover the initial state:

1. Initial scenario:

# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
        id      = 192.168.0.1
        status  = ring 0 active with no faults
RING ID 1
        id      = 192.168.1.1
        status  = ring 1 active with no faults

2. When I shutdown the node 2, all continues with no faults. Sometimes the
ring ID's are bonding with 127.0.0.1 and then bond back to their respective
heartbeat IP.

3. When node 2 is back online:

# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
        id      = 192.168.0.1
        status  = ring 0 active with no faults
RING ID 1
        id      = 192.168.1.1
        status  = Marking ringid 1 interface 192.168.1.1 FAULTY

# service corosync status
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor
preset: enabled)
   Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min 38s ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
 Main PID: 1439 (corosync)
    Tasks: 2 (limit: 4915)
   CGroup: /system.slice/corosync.service
           └─1439 /usr/sbin/corosync -f

Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
network interface [192.168.0.1] is now up.
Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
[192.168.0.1] is now up.
Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
network interface [192.168.1.1] is now up.
Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
[192.168.1.1] is now up.
Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice  [TOTEM ] A
new membership (192.168.0.1:601760) was formed. Members
Aug 22 14:44:26 node1 corosync[1439]:   [TOTEM ] A new membership (
192.168.0.1:601760) was formed. Members
Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice  [TOTEM ] A
new membership (192.168.0.1:601764) was formed. Members joined: 2
Aug 22 14:44:32 node1 corosync[1439]:   [TOTEM ] A new membership (
192.168.0.1:601764) was formed. Members joined: 2
Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error   [TOTEM ]
Marking ringid 1 interface 192.168.1.1 FAULTY
Aug 22 14:44:34 node1 corosync[1439]:   [TOTEM ] Marking ringid 1 interface
192.168.1.1 FAULTY

If I execute corosync-cfgtool, clears the faulty error but after some
seconds return to be FAULTY.
The only thing that it resolves the problem is to restart de service with
service corosync restart.

Here you have some of my configuration settings on node 1 (I probed already
to change rrp_mode):

*- corosync.conf*

totem {
        version: 2
        cluster_name: node
        token: 5000
        token_retransmits_before_loss_const: 10
        secauth: off
        threads: 0
        rrp_mode: passive
        nodeid: 1
        interface {
                ringnumber: 0
                bindnetaddr: 192.168.0.0
                #mcastaddr: 226.94.1.1
                mcastport: 5405
                broadcast: yes
        }
        interface {
                ringnumber: 1
                bindnetaddr: 192.168.1.0
                #mcastaddr: 226.94.1.2
                mcastport: 5407
                broadcast: yes
        }
}

logging {
        fileline: off
        to_stderr: yes
        to_syslog: yes
        to_logfile: yes
        logfile: /var/log/corosync/corosync.log
        debug: off
        timestamp: on
        logger_subsys {
                subsys: AMF
                debug: off
        }
}

amf {
        mode: disabled
}

quorum {
        provider: corosync_votequorum
        expected_votes: 2
}

nodelist {
        node {
                nodeid: 1
                ring0_addr: 192.168.0.1
                ring1_addr: 192.168.1.1
        }

        node {
                nodeid: 2
                ring0_addr: 192.168.0.2
                ring1_addr: 192.168.1.2
        }
}

aisexec {
        user: root
        group: root
}

service {
        name: pacemaker
        ver: 1
}

*- /etc/hosts*

127.0.0.1       localhost
10.4.172.5      node1.upc.edu node1
10.4.172.6      node2.upc.edu node2

Thank you for you help in advance!

-- 
*David Tolosa Martínez*
Customer Support & Infrastructure
UPCnet - Edifici Vèrtex
Plaça d'Eusebi Güell, 6, 08034 Barcelona
Tel: 934054555

<https://www.upcnet.es>

-- 

INFORMACIÓ BÀSICA SOBRE PROTECCIÓ DE DADES:

Responsable: UPCNET, 
Serveis d'Accés a Internet de la Universitat Politècnica de Catalunya, SLU  
 |   Finalitat: gestionar els contactes i les relacions professionals i 
comercials amb els nostres clients i proveïdors   |   Base legal: 
consentiment, interès legítim i/o relació contractual   |   Destinataris: 
no seran comunicades a tercers excepte per obligació legal   |   Drets: 
pots exercir els teus drets d’accés, rectificació i supressió, així com els 
altres drets reconeguts a la normativa vigent, enviant-nos un missatge a 
privacy at upcnet.es <mailto:privacy at upcnet.es>   |   Més informació: consulta 
la nostra política completa de protecció de dades 
<https://www.upcnet.es/politica-de-privacitat>.

AVÍS DE 
CONFIDENCIALITAT <https://www.upcnet.es/ca/avis-de-confidencialitat>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180822/a8b13700/attachment.html>