[Pacemaker] Backup ring is marked faulty

Sebastian Kaps sebastian.kaps at imail.de
Tue Aug 2 10:35:43 EDT 2011


 we're running a two-node cluster with redundant rings.
 Ring 0 is a 10 GB direct connection; ring 1 consists of two 1GB 
 interfaces that are bonded in
 active-backup mode and routed through two independent switches for each 
 node. The ring 1 network
 is our "normal" 1G LAN and should only be used in case the direct 10G 
 connection should fail.
 I often (once a day on average, I'd guess) see that ring 1 (an only 
 that one) is marked as
 FAULTY without any obvious reasons.

 Aug  2 08:56:15 node02 corosync[5752]:  [TOTEM ] Retransmit List: c76 
 c7a c7c c7e c80 c82 c84
 Aug  2 08:56:15 node02 corosync[5752]:  [TOTEM ] Retransmit List: c82
 Aug  2 08:56:15 node02 corosync[5752]:  [TOTEM ] Marking seqid 568416 
 ringid 1 interface x.y.z.1 FAULTY - administrative intervention 

 Whenever I see this, I check if the other node's address can be pinged 
 (I never saw any
 connectivity problems there), then reenable the ring with 
 "corosync-cfgtool -r" and
 everything looks ok for a while (i.e. hours or days).

 How could I find out why this happens?
 What do these "Retransmit List" or seqid (sequence id, I assume?) 
 values tell me?
 Is it safe to reenable the second ring when the partner node can be 
 pinged successfully?

 The totem section on our config looks like this:

 totem {
        rrp_mode:       passive
        join:   60
        max_messages:   20
        vsftype:        none
        consensus:      10000
        secauth:        on
        token_retransmits_before_loss_const:    10
        threads:        16
        token:  10000
        version:        2
        interface {
                mcastport:      5405
                ringnumber:     0
        interface {
                bindnetaddr:    x.y.z.0
                mcastport:      5415
                ringnumber:     1
        clear_node_high_bit:    yes


