[Pacemaker] Backup ring is marked faulty

Thu Aug 4 12:18:46 EDT 2011

On 08/03/2011 11:31 PM, Tegtmeier.Martin wrote:
> Hello again,
> 
> in my case it is always the slower ring that fails (the 100MB network). Does rrp_mode passive expect both rings to have the same speed?
> 
> Sebastian, can you confirm that in your environment also the slower ring fails?
> 
> Thanks,
>   -Martin
> 
> 

Martin,

I have never tested faster+slower networks in redundant ring configs.
We just recently added support for this feature in the corosync project
meaning we can start to tackle some of these issues going forward.

The protocol is designed to limit to the speed of the slowest ring -
perhaps this is not working as intended.

Regards
-steve

> -----Original Message-----
> From: Tegtmeier.Martin [mailto:Martin.Tegtmeier at realtech.com] 
> Sent: Mittwoch, 3. August 2011 11:03
> To: The Pacemaker cluster resource manager
> Subject: AW: [Pacemaker] Backup ring is marked faulty
> 
> Hello,
> 
> we have exactly the same issue! Same version of corosync (1.3.1), also running on SuSE Linux Enterprise Server 11 SP1 with HAE.
> 
> Aug 01 15:45:18 corosync [TOTEM ] Received ringid(172.20.16.2:308) seq 6a
> 
> Aug 01 15:45:18 corosync [TOTEM ] Received ringid(172.20.16.2:308) seq 63
> 
> Aug 01 15:45:18 corosync [TOTEM ] releasing messages up to and including 60
> 
> Aug 01 15:45:18 corosync [TOTEM ] releasing messages up to and including 6d
> 
> Aug 01 15:45:18 corosync [TOTEM ] Marking seqid 162 ringid 1 interface 10.2.2.6 FAULTY - administrative intervention required.
> 
> rksaph06:/var/log/cluster # corosync-cfgtool -s
> 
> Printing ring status.
> 
> Local node ID 101717164
> 
> RING ID 0
> 
>         id      = 172.20.16.6
> 
>         status  = ring 0 active with no faults
> 
> RING ID 1
> 
>         id      = 10.2.2.6
> 
>         status  = Marking seqid 162 ringid 1 interface 10.2.2.6 FAULTY - administrative intervention required.
> 
> 
> 
> rrp_mode is set to "passive"
> Ring 0 (172.20.16.0) supports 1GB and ring 1 (10.2.2.0) supports 100 MBit. There was no other network traffic on ring 1 - only corosync (!)
> 
> After re-activating both rings with "corosync-cfgtool -r" the problem is reproducable by simply connecting a crm_gui and hitting "refresh" inside the GUI 3-5 times. After that ring 1 (10.2.2.0) will be marked as "faulty" again.
> 
> Thanks and best regards,
>   -Martin Tegtmeier
> 
> 
> 
> 
> -----Ursprüngliche Nachricht-----
> Von: Sebastian Kaps [mailto:sebastian.kaps at imail.de]
> Gesendet: Mi 03.08.2011 08:53
> An: The Pacemaker cluster resource manager
> Betreff: Re: [Pacemaker] Backup ring is marked faulty
>  
>  Hi Steven!
> 
>  On Tue, 02 Aug 2011 17:45:46 -0700, Steven Dake wrote:
>> Which version of corosync?
> 
>  # corosync -v
>  Corosync Cluster Engine, version '1.3.1'
>  Copyright (c) 2006-2009 Red Hat, Inc.
> 
>  It's the version that comes with SLES11-SP1-HA.
> 
> --
>  Sebastian
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker