[ClusterLabs] Redundant ring not recovering after node is back
David Tolosa
david.tolosa at upcnet.es
Thu Aug 23 07:11:51 UTC 2018
I'm currently using an Ubuntu 18.04 server configuration with netplan.
Here you have my current YAML configuration:
# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
version: 2
renderer: networkd
ethernets:
eno1:
addresses: [192.168.0.1/24]
enp4s0f0:
addresses: [192.168.1.1/24]
enp5s0f0:
{}
vlans:
vlan.XXX:
id: XXX
link: enp5s0f0
addresses: [ 10.1.128.5/29 ]
gateway4: 10.1.128.1
nameservers:
addresses: [ 8.8.8.8, 8.8.4.4 ]
search: [ foo.com, bar.com ]
vlan.YYY:
id: YYY
link: enp5s0f0
addresses: [ 10.1.128.5/29 ]
So, eno1 and enp4s0f0 are the two ethernet ports connected each other with
crossover cables to node2.
enp5s0f0 port is used to connect outside/services using vlans defined in
the same file.
In short, I'm using systemd-networkd default Ubuntu 18 server service for
manage networks. Im not detecting any NetworkManager-config-server package
in my repository neither.
So the only solution that I have left, I suppose, is to test corosync 3.x
and see if it works better handling RRP.
Thank you for your quick response!
2018-08-23 8:40 GMT+02:00 Jan Friesse <jfriesse at redhat.com>:
> David,
>
> Hello,
>> Im getting crazy about this problem, that I expect to resolve here, with
>> your help guys:
>>
>> I have 2 nodes with Corosync redundant ring feature.
>>
>> Each node has 2 similarly connected/configured NIC's. Both nodes are
>> connected each other by two crossover cables.
>>
>
> I believe this is root of the problem. Are you using NetworkManager? If
> so, have you installed NetworkManager-config-server? If not, please install
> it and test again.
>
>
>> I configured both nodes with rrp mode passive. Everything is working well
>> at this point, but when I shutdown 1 node to test failover, and this node
>> > returns to be online, corosync is marking the interface as FAULTY and rrp
>>
>
> I believe it's because with crossover cables configuration when other side
> is shutdown, NetworkManager detects it and does ifdown of the interface.
> And corosync is unable to handle ifdown properly. Ifdown is bad with single
> ring, but it's just killer with RRP (127.0.0.1 poisons every node in the
> cluster).
>
> fails to recover the initial state:
>>
>> 1. Initial scenario:
>>
>> # corosync-cfgtool -s
>> Printing ring status.
>> Local node ID 1
>> RING ID 0
>> id = 192.168.0.1
>> status = ring 0 active with no faults
>> RING ID 1
>> id = 192.168.1.1
>> status = ring 1 active with no faults
>>
>>
>> 2. When I shutdown the node 2, all continues with no faults. Sometimes the
>> ring ID's are bonding with 127.0.0.1 and then bond back to their
>> respective
>> heartbeat IP.
>>
>
> Again, result of ifdown.
>
>
>> 3. When node 2 is back online:
>>
>> # corosync-cfgtool -s
>> Printing ring status.
>> Local node ID 1
>> RING ID 0
>> id = 192.168.0.1
>> status = ring 0 active with no faults
>> RING ID 1
>> id = 192.168.1.1
>> status = Marking ringid 1 interface 192.168.1.1 FAULTY
>>
>>
>> # service corosync status
>> ● corosync.service - Corosync Cluster Engine
>> Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor
>> preset: enabled)
>> Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min 38s
>> ago
>> Docs: man:corosync
>> man:corosync.conf
>> man:corosync_overview
>> Main PID: 1439 (corosync)
>> Tasks: 2 (limit: 4915)
>> CGroup: /system.slice/corosync.service
>> └─1439 /usr/sbin/corosync -f
>>
>>
>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice [TOTEM ] The
>> network interface [192.168.0.1] is now up.
>> Aug 22 14:44:11 node1 corosync[1439]: [TOTEM ] The network interface
>> [192.168.0.1] is now up.
>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice [TOTEM ] The
>> network interface [192.168.1.1] is now up.
>> Aug 22 14:44:11 node1 corosync[1439]: [TOTEM ] The network interface
>> [192.168.1.1] is now up.
>> Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice [TOTEM ] A
>> new membership (192.168.0.1:601760) was formed. Members
>> Aug 22 14:44:26 node1 corosync[1439]: [TOTEM ] A new membership (
>> 192.168.0.1:601760) was formed. Members
>> Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice [TOTEM ] A
>> new membership (192.168.0.1:601764) was formed. Members joined: 2
>> Aug 22 14:44:32 node1 corosync[1439]: [TOTEM ] A new membership (
>> 192.168.0.1:601764) was formed. Members joined: 2
>> Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error [TOTEM ]
>> Marking ringid 1 interface 192.168.1.1 FAULTY
>> Aug 22 14:44:34 node1 corosync[1439]: [TOTEM ] Marking ringid 1
>> interface
>> 192.168.1.1 FAULTY
>>
>>
>> If I execute corosync-cfgtool, clears the faulty error but after some
>> seconds return to be FAULTY.
>> The only thing that it resolves the problem is to restart de service with
>> service corosync restart.
>>
>> Here you have some of my configuration settings on node 1 (I probed
>> already
>> to change rrp_mode):
>>
>> *- corosync.conf*
>>
>>
>> totem {
>> version: 2
>> cluster_name: node
>> token: 5000
>> token_retransmits_before_loss_const: 10
>> secauth: off
>> threads: 0
>> rrp_mode: passive
>> nodeid: 1
>> interface {
>> ringnumber: 0
>> bindnetaddr: 192.168.0.0
>> #mcastaddr: 226.94.1.1
>> mcastport: 5405
>> broadcast: yes
>> }
>> interface {
>> ringnumber: 1
>> bindnetaddr: 192.168.1.0
>> #mcastaddr: 226.94.1.2
>> mcastport: 5407
>> broadcast: yes
>> }
>> }
>>
>> logging {
>> fileline: off
>> to_stderr: yes
>> to_syslog: yes
>> to_logfile: yes
>> logfile: /var/log/corosync/corosync.log
>> debug: off
>> timestamp: on
>> logger_subsys {
>> subsys: AMF
>> debug: off
>> }
>> }
>>
>> amf {
>> mode: disabled
>> }
>>
>> quorum {
>> provider: corosync_votequorum
>> expected_votes: 2
>> }
>>
>> nodelist {
>> node {
>> nodeid: 1
>> ring0_addr: 192.168.0.1
>> ring1_addr: 192.168.1.1
>> }
>>
>> node {
>> nodeid: 2
>> ring0_addr: 192.168.0.2
>> ring1_addr: 192.168.1.2
>> }
>> }
>>
>> aisexec {
>> user: root
>> group: root
>> }
>>
>> service {
>> name: pacemaker
>> ver: 1
>> }
>>
>>
>>
>> *- /etc/hosts*
>>
>>
>> 127.0.0.1 localhost
>> 10.4.172.5 node1.upc.edu node1
>> 10.4.172.6 node2.upc.edu node2
>>
>>
> So machines have 3 NICs? 2 for corosync/cluster traffic and one for
> regular traffic/services/outside world?
>
>
>> Thank you for you help in advance!
>>
>
> To conclude:
> - If you are using NetworkManager, try to install
> NetworkManager-config-server, it will probably help
> - If you are brave enough, try corosync 3.x (current Alpha4 is pretty
> stable - actually some other projects gain this stability with SP1 :) )
> that has no RRP but uses knet for support redundant links (up-to 8 links
> can be configured) and doesn't have problems with ifdown.
>
> Honza
>
>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>
--
*David Tolosa Martínez*
Customer Support & Infrastructure
UPCnet - Edifici Vèrtex
Plaça d'Eusebi Güell, 6, 08034 Barcelona
Tel: 934054555
<https://www.upcnet.es>
--
INFORMACIÓ BÀSICA SOBRE PROTECCIÓ DE DADES:
Responsable: UPCNET,
Serveis d'Accés a Internet de la Universitat Politècnica de Catalunya, SLU
| Finalitat: gestionar els contactes i les relacions professionals i
comercials amb els nostres clients i proveïdors | Base legal:
consentiment, interès legítim i/o relació contractual | Destinataris:
no seran comunicades a tercers excepte per obligació legal | Drets:
pots exercir els teus drets d’accés, rectificació i supressió, així com els
altres drets reconeguts a la normativa vigent, enviant-nos un missatge a
privacy at upcnet.es <mailto:privacy at upcnet.es> | Més informació: consulta
la nostra política completa de protecció de dades
<https://www.upcnet.es/politica-de-privacitat>.
AVÍS DE
CONFIDENCIALITAT <https://www.upcnet.es/ca/avis-de-confidencialitat>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180823/296e556c/attachment-0001.html>
More information about the Users
mailing list