[ClusterLabs] Redundant ring not recovering after node is back

Thu Aug 23 04:13:32 UTC 2018

22.08.2018 15:53, David Tolosa пишет:
> Hello,
> Im getting crazy about this problem, that I expect to resolve here, with
> your help guys:
> 
> I have 2 nodes with Corosync redundant ring feature.
> 
> Each node has 2 similarly connected/configured NIC's. Both nodes are
> connected each other by two crossover cables.
> 
> I configured both nodes with rrp mode passive. Everything is working well
> at this point, but when I shutdown 1 node to test failover, and this node
> returns to be online, corosync is marking the interface as FAULTY and rrp
> fails to recover the initial state:
> 
> 1. Initial scenario:
> 
> # corosync-cfgtool -s
> Printing ring status.
> Local node ID 1
> RING ID 0
>         id      = 192.168.0.1
>         status  = ring 0 active with no faults
> RING ID 1
>         id      = 192.168.1.1
>         status  = ring 1 active with no faults
> 
> 
> 2. When I shutdown the node 2, all continues with no faults. Sometimes the
> ring ID's are bonding with 127.0.0.1 and then bond back to their respective
> heartbeat IP.
> 
> 3. When node 2 is back online:
> 
> # corosync-cfgtool -s
> Printing ring status.
> Local node ID 1
> RING ID 0
>         id      = 192.168.0.1
>         status  = ring 0 active with no faults
> RING ID 1
>         id      = 192.168.1.1
>         status  = Marking ringid 1 interface 192.168.1.1 FAULTY
> 
> 
> # service corosync status
> ● corosync.service - Corosync Cluster Engine
>    Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor
> preset: enabled)
>    Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min 38s ago
>      Docs: man:corosync
>            man:corosync.conf
>            man:corosync_overview
>  Main PID: 1439 (corosync)
>     Tasks: 2 (limit: 4915)
>    CGroup: /system.slice/corosync.service
>            └─1439 /usr/sbin/corosync -f
> 
> 
> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
> network interface [192.168.0.1] is now up.
> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
> [192.168.0.1] is now up.
> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
> network interface [192.168.1.1] is now up.
> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
> [192.168.1.1] is now up.
> Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice  [TOTEM ] A
> new membership (192.168.0.1:601760) was formed. Members
> Aug 22 14:44:26 node1 corosync[1439]:   [TOTEM ] A new membership (
> 192.168.0.1:601760) was formed. Members
> Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice  [TOTEM ] A
> new membership (192.168.0.1:601764) was formed. Members joined: 2
> Aug 22 14:44:32 node1 corosync[1439]:   [TOTEM ] A new membership (
> 192.168.0.1:601764) was formed. Members joined: 2
> Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error   [TOTEM ]
> Marking ringid 1 interface 192.168.1.1 FAULTY
> Aug 22 14:44:34 node1 corosync[1439]:   [TOTEM ] Marking ringid 1 interface
> 192.168.1.1 FAULTY
> 
> 
> If I execute corosync-cfgtool, clears the faulty error but after some
> seconds return to be FAULTY.
> The only thing that it resolves the problem is to restart de service with
> service corosync restart.
> 
> Here you have some of my configuration settings on node 1 (I probed already
> to change rrp_mode):
> 
> *- corosync.conf*
> 
> totem {
>         version: 2
>         cluster_name: node
>         token: 5000
>         token_retransmits_before_loss_const: 10
>         secauth: off
>         threads: 0
>         rrp_mode: passive
>         nodeid: 1
>         interface {
>                 ringnumber: 0
>                 bindnetaddr: 192.168.0.0
>                 #mcastaddr: 226.94.1.1
>                 mcastport: 5405
>                 broadcast: yes
>         }
>         interface {
>                 ringnumber: 1
>                 bindnetaddr: 192.168.1.0
>                 #mcastaddr: 226.94.1.2
>                 mcastport: 5407
>                 broadcast: yes
>         }
> }
> 
> logging {
>         fileline: off
>         to_stderr: yes
>         to_syslog: yes
>         to_logfile: yes
>         logfile: /var/log/corosync/corosync.log
>         debug: off
>         timestamp: on
>         logger_subsys {
>                 subsys: AMF
>                 debug: off
>         }
> }
> 
> amf {
>         mode: disabled
> }
> 
> quorum {
>         provider: corosync_votequorum
>         expected_votes: 2
> }
> 
> nodelist {
>         node {
>                 nodeid: 1
>                 ring0_addr: 192.168.0.1
>                 ring1_addr: 192.168.1.1
>         }
> 
>         node {
>                 nodeid: 2
>                 ring0_addr: 192.168.0.2
>                 ring1_addr: 192.168.1.2
>         }
> }
> 

My understanding so far was that nodelist is used with udpu transport
only. You may try without nodelist or with transport: udpu to see if it
makes a difference.

> aisexec {
>         user: root
>         group: root
> }
> 
> service {
>         name: pacemaker
>         ver: 1
> }
> 
> 
> 
> *- /etc/hosts*
> 
> 
> 127.0.0.1       localhost
> 10.4.172.5      node1.upc.edu node1
> 10.4.172.6      node2.upc.edu node2
> 
> 
> Thank you for you help in advance!
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>