[Pacemaker] Strange behavior when starting up Corosync in a single node setup

Thu Oct 21 16:46:32 UTC 2010

> Andrew Beekhof
> Mon, 13 Sep 2010 06:25:48 -0700
> 
> Looks like corosync can't talk to itself - ie. it never sees the
> multicast messages it sends out.
> This would result in the pacemaker errors you're seeing.
> 
> Almost always this is a firewall issue :-)
> Perhaps try disabling it completely?

Sorry for the late reply. I have been seeing a lot of strange stuff and I wanted them confirmed before wasting more of your time. I also tried some fixes I had found in the net and naturally had other stuff to do.

Even though I do not always reply immediately or at all, I am always very grateful for everyone's help.

I also upgraded all our libs to the latest version and including those from the Lenny backports.
I had initially thought it was a problem with our kernel but that did not pan out.
But I'll just stick with the newer versions for now.

The strangest thing is, the behavior is completely random.

Sometimes it just works.
Sometimes it just dies after the start (could be the race condition mentioned below)
Sometimes I get these:
 crmd: [3086]: WARN: lrm_signon: can not initiate connection
 crmd: [3086]: WARN: do_lrm_control: Failed to sign on to the LRM 2 (30 max) times
 crmd: [3086]: info: ais_dispatch: Membership 92: quorum still lost
Sometimes I get these:
 crmd: [3067]: info: crm_timer_popped: Wait Timer (I_NULL) just popped!
 crmd: [3067]: info: do_cib_control: Could not connect to the CIB service: connection failed
 crmd: [3067]: WARN: do_cib_control: Couldn't complete CIB registration 2 times... pause and retry
Or these:
 crmd: [2649]: notice: Not currently connected.
 crmd: [2649]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry
 crmd: [2649]: info: te_connect_stonith: Attempting connection to fencing daemon...
 crmd: [2649]: ERROR: stonithd_signon: Can't initiate connection to stonithd

And I get these all the time.
(all above were actually taken from one machine and from subsequent reboots, but applies to all)

Strangest of all, if I stop (and kill) corosync and restart it via init.d manually, it works fine.
Even without any of the changes mentioned below.

I have tried a lot of (crazy) stuff:
* different network setups
* different hardware
* created resolv.conf
* inet config
* ntp corrections
* disabled firehol (chmod -x on the script)
* disabled bind9 (same)
* disabled drbd from the runlevels (not the script as those above)
* change runlevels as per this post (moved mine to rcS.d/S98corosync):
  http://oss.clusterlabs.org/pipermail/pacemaker/2010-February/005010.html
* upgrading to corosync to 1.2.1.2 due to this race condition bug and subsequent fix:
  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=596694
* removed stuff from the root tag in the cib.xml I thought might be a problem
  I have a template cib.xml that I use and I removed everything that crm_verify did not complain about.
* sacrificed coffee to the IT gods.

All on different machines but all with nearly the same outcome. Changes were not all done on the same machines to confirm a fix had I ever found it.
I also have a proof-of-concept machine with just the net-install+latest kernel from backports and the HA setup and it is showing similar behavior.

Summary:
Debian Lenny 64bit
linux-image-2.6.33.3

Packages:
(default)
corosync 1.2.1-1~bpo50+1
libcorosync4                        1.2.1-1~bpo50+1
(updated)
corosync 1.2.1-2
libcorosync4 1.2.1-2

cluster-glue 1.0.6-1~bpo50+1
libcluster-glue 1.0.6-1~bpo50+1
pacemaker 1.0.9.1+hg15626-1~bpo50+1
libheartbeat2 1:3.0.3-2~bpo50+1
drbd8-utils 2:8.3.7-1~bpo50+1

Any thing else you'd need?

thanks again.

Frank
-- 
GRATIS! Movie-FLAT mit über 300 Videos. 
Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome