[ClusterLabs] Redundant ring not recovering after node is back

Fri Aug 24 10:21:07 EDT 2018

On Fri, 2018-08-24 at 08:21 +0200, Jan Friesse wrote:
> > I tried to install corosync 3.x and it works pretty well.
> 
> Cool
> 
> > But when I install pacemaker, it installs previous version of
> > corosync as
> > dependency and breaks all the setup.
> > Any suggestions?
> 
> I can see at least following "solutions":
> - make proper Debian package
> - install corosync 3 to /usr/local
> - (ugly) install packaged corosync and reinstall by corosync 3 from
> source
> 
> Regards,
>    Honza

If you're compiling corosync 3, you may want to consider compiling
pacemaker 2.0.0 as well (or even pacemaker master branch, which has
extra bug fixes and should be stable at the moment).

If you're not familiar with checkinstall, it's a simple way to build
.deb packages from any "make install", so you only have to compile on
one host.

You could also get in touch with the Debian HA team ( https://wiki.debi
an.org/Debian-HA ) to see what their plans are for the new versions
and/or get tips on building.

> > 
> > 2018-08-23 9:32 GMT+02:00 Jan Friesse <jfriesse at redhat.com>:
> > 
> > > David,
> > > 
> > > BTW, where I can download Corosync 3.x?
> > > > I've only seen Corosync 2.99.3 Alpha4 at http://corosync.github
> > > > .io/coro
> > > > sync/
> > > > 
> > > 
> > > Yes, that's Alpha 4 of Corosync 3.
> > > 
> > > 
> > > 
> > > 
> > > > 2018-08-23 9:11 GMT+02:00 David Tolosa <david.tolosa at upcnet.es>
> > > > :
> > > > 
> > > > I'm currently using an Ubuntu 18.04 server configuration with
> > > > netplan.
> > > > > 
> > > > > Here you have my current YAML configuration:
> > > > > 
> > > > > # This file describes the network interfaces available on
> > > > > your system
> > > > > # For more information, see netplan(5).
> > > > > network:
> > > > >     version: 2
> > > > >     renderer: networkd
> > > > >     ethernets:
> > > > >       eno1:
> > > > >         addresses: [192.168.0.1/24]
> > > > >       enp4s0f0:
> > > > >         addresses: [192.168.1.1/24]
> > > > >       enp5s0f0:
> > > > >         {}
> > > > >     vlans:
> > > > >       vlan.XXX:
> > > > >         id: XXX
> > > > >         link: enp5s0f0
> > > > >         addresses: [ 10.1.128.5/29 ]
> > > > >         gateway4: 10.1.128.1
> > > > >         nameservers:
> > > > >           addresses: [ 8.8.8.8, 8.8.4.4 ]
> > > > >           search: [ foo.com, bar.com ]
> > > > >       vlan.YYY:
> > > > >         id: YYY
> > > > >         link: enp5s0f0
> > > > >         addresses: [ 10.1.128.5/29 ]
> > > > > 
> > > > > 
> > > > > So, eno1 and enp4s0f0 are the two ethernet ports connected
> > > > > each other
> > > > > with crossover cables to node2.
> > > > > enp5s0f0 port is used to connect outside/services using vlans
> > > > > defined in
> > > > > the same file.
> > > > > 
> > > > > In short, I'm using systemd-networkd default Ubuntu 18 server
> > > > > service for
> > > > > 
> > > 
> > > Ok, so systemd-networkd is really doing ifdown and somebody
> > > actually tries
> > > fix it and merge into upstream (sadly with not too much luck :( )
> > > 
> > > https://github.com/systemd/systemd/pull/7403
> > > 
> > > 
> > > manage networks. Im not detecting any NetworkManager-config-
> > > server
> > > > > package in my repository neither.
> > > > > 
> > > 
> > > I'm not sure how it's called in Debian based distributions, but
> > > it's just
> > > one small file in /etc, so you can extract it from RPM.
> > > 
> > > So the only solution that I have left, I suppose, is to test
> > > corosync 3.x
> > > > > and see if it works better handling RRP.
> > > > > 
> > > 
> > > You may also reconsider to try ether completely static network
> > > configuration or NetworkManager + NetworkManager-config-server.
> > > 
> > > 
> > > Corosync 3.x with knet will work for sure, but be prepared for
> > > quite a
> > > long compile path, because you first have to compile knet and
> > > then
> > > corosync. What may help you a bit is that we have a ubuntu 18.04
> > > in our
> > > jenkins, so it should be possible corosync build log
> > > https://ci.kronosnet.org/view/corosync/job/corosync-build-al
> > > l-voting/lastBuild/corosync-build-all-voting=ubuntu-18-04-lt
> > > s-x86-64/consoleText, knet build log https://ci.kronosnet.org/vie
> > > w/
> > > knet/job/knet-build-all-voting/lastBuild/knet-build-all-
> > > voting=ubuntu-18-04-lts-x86-64/consoleText).
> > > 
> > > Also please consult http://people.redhat.com/ccaul
> > > fie/docs/KnetCorosync.pdf about changes in corosync
> > > configuration.
> > > 
> > > Regards,
> > >    Honza
> > > 
> > > 
> > > > > Thank you for your quick response!
> > > > > 
> > > > > 2018-08-23 8:40 GMT+02:00 Jan Friesse <jfriesse at redhat.com>:
> > > > > 
> > > > > David,
> > > > > > 
> > > > > > Hello,
> > > > > > 
> > > > > > > Im getting crazy about this problem, that I expect to
> > > > > > > resolve here,
> > > > > > > with
> > > > > > > your help guys:
> > > > > > > 
> > > > > > > I have 2 nodes with Corosync redundant ring feature.
> > > > > > > 
> > > > > > > Each node has 2 similarly connected/configured NIC's.
> > > > > > > Both nodes are
> > > > > > > connected each other by two crossover cables.
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > I believe this is root of the problem. Are you using
> > > > > > NetworkManager? If
> > > > > > so, have you installed NetworkManager-config-server? If
> > > > > > not, please
> > > > > > install
> > > > > > it and test again.
> > > > > > 
> > > > > > 
> > > > > > I configured both nodes with rrp mode passive. Everything
> > > > > > is working
> > > > > > > well
> > > > > > > at this point, but when I shutdown 1 node to test
> > > > > > > failover, and this
> > > > > > > node > returns to be online, corosync is marking the
> > > > > > > interface as
> > > > > > > FAULTY
> > > > > > > and rrp
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > I believe it's because with crossover cables configuration
> > > > > > when other
> > > > > > side is shutdown, NetworkManager detects it and does ifdown
> > > > > > of the
> > > > > > interface. And corosync is unable to handle ifdown
> > > > > > properly. Ifdown is
> > > > > > bad
> > > > > > with single ring, but it's just killer with RRP (127.0.0.1
> > > > > > poisons every
> > > > > > node in the cluster).
> > > > > > 
> > > > > > fails to recover the initial state:
> > > > > > 
> > > > > > > 
> > > > > > > 1. Initial scenario:
> > > > > > > 
> > > > > > > # corosync-cfgtool -s
> > > > > > > Printing ring status.
> > > > > > > Local node ID 1
> > > > > > > RING ID 0
> > > > > > >            id      = 192.168.0.1
> > > > > > >            status  = ring 0 active with no faults
> > > > > > > RING ID 1
> > > > > > >            id      = 192.168.1.1
> > > > > > >            status  = ring 1 active with no faults
> > > > > > > 
> > > > > > > 
> > > > > > > 2. When I shutdown the node 2, all continues with no
> > > > > > > faults. Sometimes
> > > > > > > the
> > > > > > > ring ID's are bonding with 127.0.0.1 and then bond back
> > > > > > > to their
> > > > > > > respective
> > > > > > > heartbeat IP.
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > Again, result of ifdown.
> > > > > > 
> > > > > > 
> > > > > > 3. When node 2 is back online:
> > > > > > > 
> > > > > > > # corosync-cfgtool -s
> > > > > > > Printing ring status.
> > > > > > > Local node ID 1
> > > > > > > RING ID 0
> > > > > > >            id      = 192.168.0.1
> > > > > > >            status  = ring 0 active with no faults
> > > > > > > RING ID 1
> > > > > > >            id      = 192.168.1.1
> > > > > > >            status  = Marking ringid 1 interface
> > > > > > > 192.168.1.1 FAULTY
> > > > > > > 
> > > > > > > 
> > > > > > > # service corosync status
> > > > > > > ● corosync.service - Corosync Cluster Engine
> > > > > > >       Loaded: loaded
> > > > > > > (/lib/systemd/system/corosync.service; enabled;
> > > > > > > vendor
> > > > > > > preset: enabled)
> > > > > > >       Active: active (running) since Wed 2018-08-22
> > > > > > > 14:44:09 CEST; 1min
> > > > > > > 38s ago
> > > > > > >         Docs: man:corosync
> > > > > > >               man:corosync.conf
> > > > > > >               man:corosync_overview
> > > > > > >     Main PID: 1439 (corosync)
> > > > > > >        Tasks: 2 (limit: 4915)
> > > > > > >       CGroup: /system.slice/corosync.service
> > > > > > >               └─1439 /usr/sbin/corosync -f
> > > > > > > 
> > > > > > > 
> > > > > > > Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11
> > > > > > > notice  [TOTEM ]
> > > > > > > The
> > > > > > > network interface [192.168.0.1] is now up.
> > > > > > > Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The
> > > > > > > network interface
> > > > > > > [192.168.0.1] is now up.
> > > > > > > Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11
> > > > > > > notice  [TOTEM ]
> > > > > > > The
> > > > > > > network interface [192.168.1.1] is now up.
> > > > > > > Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The
> > > > > > > network interface
> > > > > > > [192.168.1.1] is now up.
> > > > > > > Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26
> > > > > > > notice  [TOTEM ]
> > > > > > > A
> > > > > > > new membership (192.168.0.1:601760) was formed. Members
> > > > > > > Aug 22 14:44:26 node1 corosync[1439]:   [TOTEM ] A new
> > > > > > > membership (
> > > > > > > 192.168.0.1:601760) was formed. Members
> > > > > > > Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32
> > > > > > > notice  [TOTEM ]
> > > > > > > A
> > > > > > > new membership (192.168.0.1:601764) was formed. Members
> > > > > > > joined: 2
> > > > > > > Aug 22 14:44:32 node1 corosync[1439]:   [TOTEM ] A new
> > > > > > > membership (
> > > > > > > 192.168.0.1:601764) was formed. Members joined: 2
> > > > > > > Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34
> > > > > > > error   [TOTEM ]
> > > > > > > Marking ringid 1 interface 192.168.1.1 FAULTY
> > > > > > > Aug 22 14:44:34 node1 corosync[1439]:   [TOTEM ] Marking
> > > > > > > ringid 1
> > > > > > > interface
> > > > > > > 192.168.1.1 FAULTY
> > > > > > > 
> > > > > > > 
> > > > > > > If I execute corosync-cfgtool, clears the faulty error
> > > > > > > but after some
> > > > > > > seconds return to be FAULTY.
> > > > > > > The only thing that it resolves the problem is to restart
> > > > > > > de service
> > > > > > > with
> > > > > > > service corosync restart.
> > > > > > > 
> > > > > > > Here you have some of my configuration settings on node 1
> > > > > > > (I probed
> > > > > > > already
> > > > > > > to change rrp_mode):
> > > > > > > 
> > > > > > > *- corosync.conf*
> > > > > > > 
> > > > > > > 
> > > > > > > totem {
> > > > > > >            version: 2
> > > > > > >            cluster_name: node
> > > > > > >            token: 5000
> > > > > > >            token_retransmits_before_loss_const: 10
> > > > > > >            secauth: off
> > > > > > >            threads: 0
> > > > > > >            rrp_mode: passive
> > > > > > >            nodeid: 1
> > > > > > >            interface {
> > > > > > >                    ringnumber: 0
> > > > > > >                    bindnetaddr: 192.168.0.0
> > > > > > >                    #mcastaddr: 226.94.1.1
> > > > > > >                    mcastport: 5405
> > > > > > >                    broadcast: yes
> > > > > > >            }
> > > > > > >            interface {
> > > > > > >                    ringnumber: 1
> > > > > > >                    bindnetaddr: 192.168.1.0
> > > > > > >                    #mcastaddr: 226.94.1.2
> > > > > > >                    mcastport: 5407
> > > > > > >                    broadcast: yes
> > > > > > >            }
> > > > > > > }
> > > > > > > 
> > > > > > > logging {
> > > > > > >            fileline: off
> > > > > > >            to_stderr: yes
> > > > > > >            to_syslog: yes
> > > > > > >            to_logfile: yes
> > > > > > >            logfile: /var/log/corosync/corosync.log
> > > > > > >            debug: off
> > > > > > >            timestamp: on
> > > > > > >            logger_subsys {
> > > > > > >                    subsys: AMF
> > > > > > >                    debug: off
> > > > > > >            }
> > > > > > > }
> > > > > > > 
> > > > > > > amf {
> > > > > > >            mode: disabled
> > > > > > > }
> > > > > > > 
> > > > > > > quorum {
> > > > > > >            provider: corosync_votequorum
> > > > > > >            expected_votes: 2
> > > > > > > }
> > > > > > > 
> > > > > > > nodelist {
> > > > > > >            node {
> > > > > > >                    nodeid: 1
> > > > > > >                    ring0_addr: 192.168.0.1
> > > > > > >                    ring1_addr: 192.168.1.1
> > > > > > >            }
> > > > > > > 
> > > > > > >            node {
> > > > > > >                    nodeid: 2
> > > > > > >                    ring0_addr: 192.168.0.2
> > > > > > >                    ring1_addr: 192.168.1.2
> > > > > > >            }
> > > > > > > }
> > > > > > > 
> > > > > > > aisexec {
> > > > > > >            user: root
> > > > > > >            group: root
> > > > > > > }
> > > > > > > 
> > > > > > > service {
> > > > > > >            name: pacemaker
> > > > > > >            ver: 1
> > > > > > > }
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > *- /etc/hosts*
> > > > > > > 
> > > > > > > 
> > > > > > > 127.0.0.1       localhost
> > > > > > > 10.4.172.5      node1.upc.edu node1
> > > > > > > 10.4.172.6      node2.upc.edu node2
> > > > > > > 
> > > > > > > 
> > > > > > > So machines have 3 NICs? 2 for corosync/cluster traffic
> > > > > > > and one for
> > > > > > 
> > > > > > regular traffic/services/outside world?
> > > > > > 
> > > > > > 
> > > > > > Thank you for you help in advance!
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > To conclude:
> > > > > > - If you are using NetworkManager, try to install
> > > > > > NetworkManager-config-server, it will probably help
> > > > > > - If you are brave enough, try corosync 3.x (current Alpha4
> > > > > > is pretty
> > > > > > stable - actually some other projects gain this stability
> > > > > > with SP1 :) )
> > > > > > that has no RRP but uses knet for support redundant links
> > > > > > (up-to 8 links
> > > > > > can be configured) and doesn't have problems with ifdown.
> > > > > > 
> > > > > > Honza
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > _______________________________________________
> > > > > > > Users mailing list: Users at clusterlabs.org
> > > > > > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > > > > > 
> > > > > > > Project Home: http://www.clusterlabs.org
> > > > > > > Getting started: http://www.clusterlabs.org/doc
> > > > > > > /Cluster_from_Scratch.pdf
> > > > > > > Bugs: http://bugs.clusterlabs.org
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > 
> > > > > --
> > > > > *David Tolosa Martínez*
> > > > > Customer Support & Infrastructure
> > > > > UPCnet - Edifici Vèrtex
> > > > > Plaça d'Eusebi Güell, 6, 08034 Barcelona
> > > > > Tel: 934054555
> > > > > 
> > > > > <https://www.upcnet.es>
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > > 
> > 
> > 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot <kgaillot at redhat.com>