[Pacemaker] Backup ring is marked faulty

Tue Aug 2 20:45:46 EDT 2011

Which version of corosync?

On 08/02/2011 07:35 AM, Sebastian Kaps wrote:
> Hi,
> 
> we're running a two-node cluster with redundant rings.
> Ring 0 is a 10 GB direct connection; ring 1 consists of two 1GB
> interfaces that are bonded in
> active-backup mode and routed through two independent switches for each
> node. The ring 1 network
> is our "normal" 1G LAN and should only be used in case the direct 10G
> connection should fail.
> I often (once a day on average, I'd guess) see that ring 1 (an only that
> one) is marked as
> FAULTY without any obvious reasons.
> 
> Aug  2 08:56:15 node02 corosync[5752]:  [TOTEM ] Retransmit List: c76
> c7a c7c c7e c80 c82 c84
> Aug  2 08:56:15 node02 corosync[5752]:  [TOTEM ] Retransmit List: c82
> Aug  2 08:56:15 node02 corosync[5752]:  [TOTEM ] Marking seqid 568416
> ringid 1 interface x.y.z.1 FAULTY - administrative intervention required.
> 
> Whenever I see this, I check if the other node's address can be pinged
> (I never saw any
> connectivity problems there), then reenable the ring with
> "corosync-cfgtool -r" and
> everything looks ok for a while (i.e. hours or days).
> 
> How could I find out why this happens?
> What do these "Retransmit List" or seqid (sequence id, I assume?) values
> tell me?
> Is it safe to reenable the second ring when the partner node can be
> pinged successfully?
> 
> The totem section on our config looks like this:
> 
> totem {
>        rrp_mode:       passive
>        join:   60
>        max_messages:   20
>        vsftype:        none
>        consensus:      10000
>        secauth:        on
>        token_retransmits_before_loss_const:    10
>        threads:        16
>        token:  10000
>        version:        2
>        interface {
>                bindnetaddr:    192.168.1.0
>                mcastaddr:      239.250.1.1
>                mcastport:      5405
>                ringnumber:     0
>        }
>        interface {
>                bindnetaddr:    x.y.z.0
>                mcastaddr:      239.250.1.2
>                mcastport:      5415
>                ringnumber:     1
>        }
>        clear_node_high_bit:    yes
> }
>