[ClusterLabs] Corosync with passive rrp, udpu - Unable to reset after "Marking ringid 1 interface 127.0.0.1 FAULTY"

Thu Jun 16 14:51:55 UTC 2016

Hello everyone,

we run a 3 node Pacemaker (1.1.14) / Corosync (2.3.5) cluster for a couple of
months successfully and we have started seeing a faulty ring with unexpected
 127.0.0.1 binding that we cannot reset via "corosync-cfgtool -r".

We have had this once before and only restarting Corosync (and everything else)
on the node showing the unexpected 127.0.0.1 binding made the problem go away.
However, in production we obviously would like to avoid this if possible.

So from the following description - how can I troubleshoot this issue and/or
does anybody have a good idea what might be happening here ?

We run 2x passive rrp rings across different IP-subnets via udpu and we get the
following output (all IPs obfuscated) - please notice the unexpected interface
binding 127.0.0.1 for host pg2.

If we reset via "corosync-cfgtool -r" on each node heartbeat ring id 1 briefly
shows "no faults" but goes back to "FAULTY" seconds later.

Regards,
Martin Schlegel
_____________________________________

root at pg1:~# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
        id      = A.B.C1.5
        status  = ring 0 active with no faults
RING ID 1
        id      = D.E.F1.170
        status  = Marking ringid 1 interface D.E.F1.170 FAULTY

root at pg2:~# corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
        id      = A.B.C2.88
        status  = ring 0 active with no faults
RING ID 1
        id      = 127.0.0.1
        status  = Marking ringid 1 interface 127.0.0.1 FAULTY

root at pg3:~# corosync-cfgtool -s
Printing ring status.
Local node ID 3
RING ID 0
        id      = A.B.C3.236
        status  = ring 0 active with no faults
RING ID 1
        id      = D.E.F3.112
        status  = Marking ringid 1 interface D.E.F3.112 FAULTY

_____________________________________

/etc/corosync/corosync.conf from pg1 0 other nodes use different subnets and
IPs, but are otherwise identical:
===========================================
quorum {
    provider: corosync_votequorum
    expected_votes: 3
}

totem {
        version: 2

        crypto_cipher: none
        crypto_hash: none

        rrp_mode: passive
        interface {
                ringnumber: 0
                bindnetaddr: A.B.C1.0
                mcastport: 5405
                ttl: 1
        }
        interface {
                ringnumber: 1
                bindnetaddr: D.E.F1.64
                mcastport: 5405
                ttl: 1
        }
        transport: udpu
}

nodelist {
        node {
                ring0_addr: pg1
                ring1_addr: pg1p
                nodeid: 1
        }
        node {
                ring0_addr: pg2
                ring1_addr: pg2p
                nodeid: 2
        }
        node {
                ring0_addr: pg3
                ring1_addr: pg3p
                nodeid: 3
        }
}

logging {
    to_syslog: yes
}

===========================================