[Pacemaker] Strange behavior when starting up Corosync in a single node setup
Andrew Beekhof
andrew at beekhof.net
Wed Oct 27 06:31:53 EDT 2010
On Thu, Oct 21, 2010 at 6:46 PM, Stephan-Frank Henry
<Frank.Henry at gmx.net> wrote:
>> Andrew Beekhof
>> Mon, 13 Sep 2010 06:25:48 -0700
>>
>> Looks like corosync can't talk to itself - ie. it never sees the
>> multicast messages it sends out.
>> This would result in the pacemaker errors you're seeing.
>>
>> Almost always this is a firewall issue :-)
>> Perhaps try disabling it completely?
>
> Sorry for the late reply. I have been seeing a lot of strange stuff and I wanted them confirmed before wasting more of your time. I also tried some fixes I had found in the net and naturally had other stuff to do.
>
> Even though I do not always reply immediately or at all, I am always very grateful for everyone's help.
>
> I also upgraded all our libs to the latest version and including those from the Lenny backports.
> I had initially thought it was a problem with our kernel but that did not pan out.
> But I'll just stick with the newer versions for now.
>
> The strangest thing is, the behavior is completely random.
>
> Sometimes it just works.
> Sometimes it just dies after the start (could be the race condition mentioned below)
> Sometimes I get these:
> crmd: [3086]: WARN: lrm_signon: can not initiate connection
> crmd: [3086]: WARN: do_lrm_control: Failed to sign on to the LRM 2 (30 max) times
> crmd: [3086]: info: ais_dispatch: Membership 92: quorum still lost
> Sometimes I get these:
> crmd: [3067]: info: crm_timer_popped: Wait Timer (I_NULL) just popped!
> crmd: [3067]: info: do_cib_control: Could not connect to the CIB service: connection failed
> crmd: [3067]: WARN: do_cib_control: Couldn't complete CIB registration 2 times... pause and retry
warnings are just that, warnings.
in this case a couple of processes are taking a little while to start.
> Or these:
> crmd: [2649]: notice: Not currently connected.
> crmd: [2649]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry
> crmd: [2649]: info: te_connect_stonith: Attempting connection to fencing daemon...
> crmd: [2649]: ERROR: stonithd_signon: Can't initiate connection to stonithd
What does "ps axf" show?
I'm guessing you'll see one or more zombie processes where stonithd is
supposed to be.
Which would mean you're hitting this:
http://theclusterguy.clusterlabs.org/post/907043024/introducing-the-pacemaker-master-control-process-for
Grab 1.1.4 and use option 2.
> And I get these all the time.
> (all above were actually taken from one machine and from subsequent reboots, but applies to all)
>
> Strangest of all, if I stop (and kill) corosync and restart it via init.d manually, it works fine.
> Even without any of the changes mentioned below.
>
> I have tried a lot of (crazy) stuff:
> * different network setups
> * different hardware
> * created resolv.conf
> * inet config
> * ntp corrections
> * disabled firehol (chmod -x on the script)
> * disabled bind9 (same)
> * disabled drbd from the runlevels (not the script as those above)
> * change runlevels as per this post (moved mine to rcS.d/S98corosync):
> http://oss.clusterlabs.org/pipermail/pacemaker/2010-February/005010.html
> * upgrading to corosync to 1.2.1.2 due to this race condition bug and subsequent fix:
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=596694
> * removed stuff from the root tag in the cib.xml I thought might be a problem
> I have a template cib.xml that I use and I removed everything that crm_verify did not complain about.
> * sacrificed coffee to the IT gods.
>
> All on different machines but all with nearly the same outcome. Changes were not all done on the same machines to confirm a fix had I ever found it.
> I also have a proof-of-concept machine with just the net-install+latest kernel from backports and the HA setup and it is showing similar behavior.
>
> Summary:
> Debian Lenny 64bit
> linux-image-2.6.33.3
>
> Packages:
> (default)
> corosync 1.2.1-1~bpo50+1
> libcorosync4 1.2.1-1~bpo50+1
> (updated)
> corosync 1.2.1-2
> libcorosync4 1.2.1-2
>
> cluster-glue 1.0.6-1~bpo50+1
> libcluster-glue 1.0.6-1~bpo50+1
> pacemaker 1.0.9.1+hg15626-1~bpo50+1
> libheartbeat2 1:3.0.3-2~bpo50+1
> drbd8-utils 2:8.3.7-1~bpo50+1
>
> Any thing else you'd need?
>
> thanks again.
>
> Frank
> --
> GRATIS! Movie-FLAT mit über 300 Videos.
> Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
More information about the Pacemaker
mailing list