[Pacemaker] Backup ring is marked faulty

Tue Aug 2 14:35:43 UTC 2011

 Hi,

 we're running a two-node cluster with redundant rings.
 Ring 0 is a 10 GB direct connection; ring 1 consists of two 1GB 
 interfaces that are bonded in
 active-backup mode and routed through two independent switches for each 
 node. The ring 1 network
 is our "normal" 1G LAN and should only be used in case the direct 10G 
 connection should fail.
 I often (once a day on average, I'd guess) see that ring 1 (an only 
 that one) is marked as
 FAULTY without any obvious reasons.

 Aug  2 08:56:15 node02 corosync[5752]:  [TOTEM ] Retransmit List: c76 
 c7a c7c c7e c80 c82 c84
 Aug  2 08:56:15 node02 corosync[5752]:  [TOTEM ] Retransmit List: c82
 Aug  2 08:56:15 node02 corosync[5752]:  [TOTEM ] Marking seqid 568416 
 ringid 1 interface x.y.z.1 FAULTY - administrative intervention 
 required.

 Whenever I see this, I check if the other node's address can be pinged 
 (I never saw any
 connectivity problems there), then reenable the ring with 
 "corosync-cfgtool -r" and
 everything looks ok for a while (i.e. hours or days).

 How could I find out why this happens?
 What do these "Retransmit List" or seqid (sequence id, I assume?) 
 values tell me?
 Is it safe to reenable the second ring when the partner node can be 
 pinged successfully?

 The totem section on our config looks like this:

 totem {
        rrp_mode:       passive
        join:   60
        max_messages:   20
        vsftype:        none
        consensus:      10000
        secauth:        on
        token_retransmits_before_loss_const:    10
        threads:        16
        token:  10000
        version:        2
        interface {
                bindnetaddr:    192.168.1.0
                mcastaddr:      239.250.1.1
                mcastport:      5405
                ringnumber:     0
        }
        interface {
                bindnetaddr:    x.y.z.0
                mcastaddr:      239.250.1.2
                mcastport:      5415
                ringnumber:     1
        }
        clear_node_high_bit:    yes
 }

-- 
 Sebastian