[ClusterLabs] Redundant ring not recovering after node is back
Andrei Borzenkov
arvidjaar at gmail.com
Thu Aug 23 00:13:32 EDT 2018
22.08.2018 15:53, David Tolosa пишет:
> Hello,
> Im getting crazy about this problem, that I expect to resolve here, with
> your help guys:
>
> I have 2 nodes with Corosync redundant ring feature.
>
> Each node has 2 similarly connected/configured NIC's. Both nodes are
> connected each other by two crossover cables.
>
> I configured both nodes with rrp mode passive. Everything is working well
> at this point, but when I shutdown 1 node to test failover, and this node
> returns to be online, corosync is marking the interface as FAULTY and rrp
> fails to recover the initial state:
>
> 1. Initial scenario:
>
> # corosync-cfgtool -s
> Printing ring status.
> Local node ID 1
> RING ID 0
> id = 192.168.0.1
> status = ring 0 active with no faults
> RING ID 1
> id = 192.168.1.1
> status = ring 1 active with no faults
>
>
> 2. When I shutdown the node 2, all continues with no faults. Sometimes the
> ring ID's are bonding with 127.0.0.1 and then bond back to their respective
> heartbeat IP.
>
> 3. When node 2 is back online:
>
> # corosync-cfgtool -s
> Printing ring status.
> Local node ID 1
> RING ID 0
> id = 192.168.0.1
> status = ring 0 active with no faults
> RING ID 1
> id = 192.168.1.1
> status = Marking ringid 1 interface 192.168.1.1 FAULTY
>
>
> # service corosync status
> ● corosync.service - Corosync Cluster Engine
> Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor
> preset: enabled)
> Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min 38s ago
> Docs: man:corosync
> man:corosync.conf
> man:corosync_overview
> Main PID: 1439 (corosync)
> Tasks: 2 (limit: 4915)
> CGroup: /system.slice/corosync.service
> └─1439 /usr/sbin/corosync -f
>
>
> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice [TOTEM ] The
> network interface [192.168.0.1] is now up.
> Aug 22 14:44:11 node1 corosync[1439]: [TOTEM ] The network interface
> [192.168.0.1] is now up.
> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice [TOTEM ] The
> network interface [192.168.1.1] is now up.
> Aug 22 14:44:11 node1 corosync[1439]: [TOTEM ] The network interface
> [192.168.1.1] is now up.
> Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice [TOTEM ] A
> new membership (192.168.0.1:601760) was formed. Members
> Aug 22 14:44:26 node1 corosync[1439]: [TOTEM ] A new membership (
> 192.168.0.1:601760) was formed. Members
> Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice [TOTEM ] A
> new membership (192.168.0.1:601764) was formed. Members joined: 2
> Aug 22 14:44:32 node1 corosync[1439]: [TOTEM ] A new membership (
> 192.168.0.1:601764) was formed. Members joined: 2
> Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error [TOTEM ]
> Marking ringid 1 interface 192.168.1.1 FAULTY
> Aug 22 14:44:34 node1 corosync[1439]: [TOTEM ] Marking ringid 1 interface
> 192.168.1.1 FAULTY
>
>
> If I execute corosync-cfgtool, clears the faulty error but after some
> seconds return to be FAULTY.
> The only thing that it resolves the problem is to restart de service with
> service corosync restart.
>
> Here you have some of my configuration settings on node 1 (I probed already
> to change rrp_mode):
>
> *- corosync.conf*
>
> totem {
> version: 2
> cluster_name: node
> token: 5000
> token_retransmits_before_loss_const: 10
> secauth: off
> threads: 0
> rrp_mode: passive
> nodeid: 1
> interface {
> ringnumber: 0
> bindnetaddr: 192.168.0.0
> #mcastaddr: 226.94.1.1
> mcastport: 5405
> broadcast: yes
> }
> interface {
> ringnumber: 1
> bindnetaddr: 192.168.1.0
> #mcastaddr: 226.94.1.2
> mcastport: 5407
> broadcast: yes
> }
> }
>
> logging {
> fileline: off
> to_stderr: yes
> to_syslog: yes
> to_logfile: yes
> logfile: /var/log/corosync/corosync.log
> debug: off
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> }
> }
>
> amf {
> mode: disabled
> }
>
> quorum {
> provider: corosync_votequorum
> expected_votes: 2
> }
>
> nodelist {
> node {
> nodeid: 1
> ring0_addr: 192.168.0.1
> ring1_addr: 192.168.1.1
> }
>
> node {
> nodeid: 2
> ring0_addr: 192.168.0.2
> ring1_addr: 192.168.1.2
> }
> }
>
My understanding so far was that nodelist is used with udpu transport
only. You may try without nodelist or with transport: udpu to see if it
makes a difference.
> aisexec {
> user: root
> group: root
> }
>
> service {
> name: pacemaker
> ver: 1
> }
>
>
>
> *- /etc/hosts*
>
>
> 127.0.0.1 localhost
> 10.4.172.5 node1.upc.edu node1
> 10.4.172.6 node2.upc.edu node2
>
>
> Thank you for you help in advance!
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
More information about the Users
mailing list