[Pacemaker] Backup ring is marked faulty
Sebastian Kaps
sebastian.kaps at imail.de
Tue Aug 2 14:35:43 UTC 2011
Hi,
we're running a two-node cluster with redundant rings.
Ring 0 is a 10 GB direct connection; ring 1 consists of two 1GB
interfaces that are bonded in
active-backup mode and routed through two independent switches for each
node. The ring 1 network
is our "normal" 1G LAN and should only be used in case the direct 10G
connection should fail.
I often (once a day on average, I'd guess) see that ring 1 (an only
that one) is marked as
FAULTY without any obvious reasons.
Aug 2 08:56:15 node02 corosync[5752]: [TOTEM ] Retransmit List: c76
c7a c7c c7e c80 c82 c84
Aug 2 08:56:15 node02 corosync[5752]: [TOTEM ] Retransmit List: c82
Aug 2 08:56:15 node02 corosync[5752]: [TOTEM ] Marking seqid 568416
ringid 1 interface x.y.z.1 FAULTY - administrative intervention
required.
Whenever I see this, I check if the other node's address can be pinged
(I never saw any
connectivity problems there), then reenable the ring with
"corosync-cfgtool -r" and
everything looks ok for a while (i.e. hours or days).
How could I find out why this happens?
What do these "Retransmit List" or seqid (sequence id, I assume?)
values tell me?
Is it safe to reenable the second ring when the partner node can be
pinged successfully?
The totem section on our config looks like this:
totem {
rrp_mode: passive
join: 60
max_messages: 20
vsftype: none
consensus: 10000
secauth: on
token_retransmits_before_loss_const: 10
threads: 16
token: 10000
version: 2
interface {
bindnetaddr: 192.168.1.0
mcastaddr: 239.250.1.1
mcastport: 5405
ringnumber: 0
}
interface {
bindnetaddr: x.y.z.0
mcastaddr: 239.250.1.2
mcastport: 5415
ringnumber: 1
}
clear_node_high_bit: yes
}
--
Sebastian
More information about the Pacemaker
mailing list