[Pacemaker] problem with pacemaker/corosync on CentOS 6.3

Tue Jul 24 09:27:14 EDT 2012

Hi,

I´m glad to report that after a restart of all corosync and pacemaker services the cluster is back on normal operation. A manuel failover is working fine and eveything shifts smoothly.

Thanks to everyone for their support !

Kind regards

fatcharly

-------- Original-Nachricht --------
> Datum: Tue, 24 Jul 2012 15:13:39 +0200
> Von: fatcharly at gmx.de
> An: Jake Smith <jsmith at argotec.com>, The Pacemaker cluster resource manager <pacemaker at oss.clusterlabs.org>
> Betreff: Re: [Pacemaker] problem with pacemaker/corosync  on CentOS 6.3

> Hi,
> 
> here are the results of the corosync status. Can´t find a problem there:
> 
> pilotpound:
> 
> [root at pilotpound ~]# corosync-cfgtool -s
> Printing ring status.
> Local node ID 425699520
> RING ID 0
>         id      = 192.168.95.25
>         status  = ring 0 active with no faults
> RING ID 1
>         id      = 192.168.20.245
>         status  = ring 1 active with no faults
> [root at pilotpound ~]# corosync-objctl | grep member
> runtime.totem.pg.mrp.srp.members.425699520.ip=r(0) ip(192.168.95.25) r(1)
> ip(192.168.20.245)
> runtime.totem.pg.mrp.srp.members.425699520.join_count=1
> runtime.totem.pg.mrp.srp.members.425699520.status=joined
> runtime.totem.pg.mrp.srp.members.442476736.ip=r(0) ip(192.168.95.26) r(1)
> ip(192.168.20.246)
> runtime.totem.pg.mrp.srp.members.442476736.join_count=1
> runtime.totem.pg.mrp.srp.members.442476736.status=joined
> 
> 
> powerpound:
> 
> [root at powerpound ~]# corosync-cfgtool -s
> Printing ring status.
> Local node ID 442476736
> RING ID 0
>         id      = 192.168.95.26
>         status  = ring 0 active with no faults
> RING ID 1
>         id      = 192.168.20.246
>         status  = ring 1 active with no faults
> [root at powerpound ~]# corosync-objctl | grep member
> runtime.totem.pg.mrp.srp.members.442476736.ip=r(0) ip(192.168.95.26) r(1)
> ip(192.168.20.246)
> runtime.totem.pg.mrp.srp.members.442476736.join_count=1
> runtime.totem.pg.mrp.srp.members.442476736.status=joined
> runtime.totem.pg.mrp.srp.members.425699520.ip=r(0) ip(192.168.95.25) r(1)
> ip(192.168.20.245)
> runtime.totem.pg.mrp.srp.members.425699520.join_count=5
> runtime.totem.pg.mrp.srp.members.425699520.status=joined
> 
> So I think I´ve got to swollow the bitter pill and restart the whole
> cluster.
> 
> I will report about the result.
> 
> Kind regards
> 
> fatcharly
> 
>  
> -------- Original-Nachricht --------
> > Datum: Fri, 20 Jul 2012 12:21:47 -0400 (EDT)
> > Von: Jake Smith <jsmith at argotec.com>
> > An: The Pacemaker cluster resource manager
> <pacemaker at oss.clusterlabs.org>
> > Betreff: Re: [Pacemaker] problem with pacemaker/corosync  on CentOS 6.3
> 
> > 
> > ----- Original Message -----
> > > From: fatcharly at gmx.de
> > > To: "Jake Smith" <jsmith at argotec.com>, "The Pacemaker cluster resource
> > manager" <pacemaker at oss.clusterlabs.org>
> > > Sent: Friday, July 20, 2012 11:50:52 AM
> > > Subject: Re: [Pacemaker] problem with pacemaker/corosync  on CentOS
> 6.3
> > > 
> > > Hi Jake,
> > > 
> > > I erased the files as mentioned und started the services. This is
> > > what I get on pilotpound after crm_mon :
> > > 
> > > ============
> > > Last updated: Fri Jul 20 17:45:58 2012
> > > Last change:
> > > Current DC: NONE
> > > 0 Nodes configured, unknown expected votes
> > > 0 Resources configured.
> > > ============
> > > 
> > > 
> > > Looks like the system didn´t joined the cluster.
> > > 
> > > Any suggestions are welcome
> > 
> > Oh maybe worth checking corosync membership and see what it says now:
> >
> http://www.hastexo.com/resources/hints-and-kinks/checking-corosync-cluster-membership
> > 
> > > 
> > > Kind regards
> > > 
> > > fatharly
> > > 
> > > ------- Original-Nachricht --------
> > > > Datum: Fri, 20 Jul 2012 10:49:15 -0400 (EDT)
> > > > Von: Jake Smith <jsmith at argotec.com>
> > > > An: The Pacemaker cluster resource manager
> > > > <pacemaker at oss.clusterlabs.org>
> > > > Betreff: Re: [Pacemaker] problem with pacemaker/corosync  on CentOS
> > > > 6.3
> > > 
> > > > 
> > > > ----- Original Message -----
> > > > > From: fatcharly at gmx.de
> > > > > To: pacemaker at oss.clusterlabs.org
> > > > > Sent: Friday, July 20, 2012 6:08:45 AM
> > > > > Subject: [Pacemaker] problem with pacemaker/corosync  on CentOS
> > > > > 6.3
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > I´m using a pacemaker+corosync bundle to run a pound based
> > > > > loadbalancer. After an update on CentOS 6.3 there is some
> > > > > mismatch
> > > > > of the node status. Via crm_mon on one node eveything looks fine
> > > > > while on the other node everything is offline. Everything was
> > > > > fine
> > > > > on CentOS 6.2.
> > > > > 
> > > > > Node powerpound:
> > > > > 
> > > > > ============
> > > > > Last updated: Fri Jul 20 12:04:29 2012
> > > > > Last change: Thu Jul 19 17:58:31 2012 via crm_attribute on
> > > > > pilotpound
> > > > > Stack: openais
> > > > > Current DC: powerpound - partition with quorum
> > > > > Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
> > > > > 2 Nodes configured, 2 expected votes
> > > > > 7 Resources configured.
> > > > > ============
> > > > > 
> > > > > Online: [ powerpound pilotpound ]
> > > > > 
> > > > > HA_IP_1 (ocf::heartbeat:IPaddr2):       Started powerpound
> > > > > HA_IP_2 (ocf::heartbeat:IPaddr2):       Started powerpound
> > > > > HA_IP_3 (ocf::heartbeat:IPaddr2):       Started powerpound
> > > > > HA_IP_4 (ocf::heartbeat:IPaddr2):       Started powerpound
> > > > > HA_IP_5 (ocf::heartbeat:IPaddr2):       Started powerpound
> > > > >  Clone Set: pingclone [ping-gateway]
> > > > >      Started: [ pilotpound powerpound ]
> > > > > 
> > > > > 
> > > > > Node pilotpound:
> > > > > 
> > > > > ============
> > > > > Last updated: Fri Jul 20 12:04:32 2012
> > > > > Last change: Thu Jul 19 17:58:17 2012 via crm_attribute on
> > > > > pilotpound
> > > > > Stack: openais
> > > > > Current DC: NONE
> > > > > 2 Nodes configured, 2 expected votes
> > > > > 7 Resources configured.
> > > > > ============
> > > > > 
> > > > > OFFLINE: [ powerpound pilotpound ]
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > from /var/log/messages on pilotpound:
> > > > > 
> > > > > Jul 20 12:06:12 pilotpound cib[24755]:  warning:
> > > > > cib_peer_callback:
> > > > > Discarding cib_apply_diff message (35909) from powerpound: not in
> > > > > our mem          bership
> > > > > Jul 20 12:06:12 pilotpound cib[24755]:  warning:
> > > > > cib_peer_callback:
> > > > > Discarding cib_apply_diff message (35910) from powerpound: not in
> > > > > our mem          bership
> > > > > 
> > > > > 
> > > > > 
> > > > > how could this happened and what can I do to solve this problem ?
> > > > 
> > > > Pretty sure it had nothing to do with upgrade - I had this the
> > > > other day
> > > > on Ubuntu 12.04 after a reboot of both nodes.  I believe a couple
> > > > experts
> > > > called it a "transient" bug.  See:
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=820821
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=5040
> > > > 
> > > > > 
> > > > > Any suggestions are welcome
> > > > 
> > > > I fixed by stopping/killing pacemaker/corosync on offending node
> > > > (pilotpound).  Then cleared these files out on same node:
> > > > rm /var/lib/heartbeat/crm/cib*
> > > > rm /var/lib/pengine/*
> > > > 
> > > > Then restart corosync/pacemaker and the node rejoined fine.
> > > > 
> > > > HTH
> > > > 
> > > > Jake
> > > > 
> > > > _______________________________________________
> > > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > > > 
> > > > Project Home: http://www.clusterlabs.org
> > > > Getting started:
> > > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > > Bugs: http://bugs.clusterlabs.org
> > > 
> > > 
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org