[Pacemaker] getting started - crm hangs when adding resources, even "crm ra classes" hangs

Tue Mar 13 14:21:46 EDT 2012

----- Original Message -----
> From: "Phillip Frost" <phil at macprofessionals.com>
> To: pacemaker at oss.clusterlabs.org
> Sent: Tuesday, March 13, 2012 1:21:00 PM
> Subject: [Pacemaker] getting started - crm hangs when adding resources,	even "crm ra classes" hangs
> 
> I'm trying to set up pacemaker for the first time, following the
> instructions in clusters from scratch, on Debian squeeze, using
> pacemaker and corosync from squeeze-backports. I seem to have gotten
> as far as getting two nodes in the cluster:
> 
> # crm status
> ============
> Last updated: Tue Mar 13 13:02:37 2012
> Last change: Tue Mar 13 12:50:25 2012 via cibadmin on xenhost02
> Stack: openais
> Current DC: xenhost02 - partition with quorum
> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
> 2 Nodes configured, 2 expected votes
> 0 Resources configured.
> ============
> 
> Online: [ xenhost02 xen01 ]
> 
> However, that's as far as I can get. The next step in clusters from
> scratch is configuring an IP address resource. Running this command
> seems to never terminate, with no output:
> 
> crm configure primitive ClusterIP ocf:heartbeat:IPaddr2 params
> ip=192.168.122.101 cidr_netmask=32 op monitor interval=30s
> 
> more interestingly, even "crm ra classes" never terminates, again
> with no output, and nothing appended to syslog.
> 
> I've also noticed that if I attempt to stop pacemaker
> (/etc/init.d/pacemaker stop), it doesn't stop. I get this in syslog:
> 
> Mar 13 12:35:52 xen01 crmd: [9937]: info: crm_shutdown: Requesting
> shutdown
> Mar 13 12:35:52 xen01 crmd: [9937]: notice: crm_shutdown: Forcing
> shutdown in: 1200000ms
> Mar 13 12:35:52 xen01 crmd: [9937]: info: do_shutdown_req: Sending
> shutdown request to DC: xenhost02
> Mar 13 12:35:52 xen01 corosync[9897]:   [TOTEM ] Retransmit List: 65
> Mar 13 12:35:52 xen01 corosync[9897]:   [TOTEM ] Retransmit List: 65
> Mar 13 12:35:52 xen01 corosync[9897]:   [TOTEM ] Retransmit List: 65
> [repeating, several times per second]
> 

You don't have anything in the log from lrmd do you?

In Ubuntu 10.04 there is a bug in glib causing hanging on shutdown as well as hanging on some crm commands - there are patches out to fix it for Ubuntu specifically (https://bugs.launchpad.net/ubuntu/oneiric/+source/cluster-glue/+bug/821732).  Not sure if they affect Debian too.

Here is an excerpt from the above bug to test for the problem.  I think the steps would work for Debian too:
Open few client->server connections:
        lrmadmin -C ; lrmadmin -C ; lrmadmin -C ; lrmadmin -C
Check number of open sockets:
        lsof -f | grep lrm_callback_sock | wc -l
Correct value is 2, but it will be 6 or 8. There's a socket leak.

Here is the patch to glib2.0 that was needed to fix:
https://mail.gnome.org/archives/commits-list/2010-November/msg01816.html

HTH

Jake

> I can only guess that some lower-level communication between the
> nodes is not working. The issue is I have no idea what the lower
> levels are, or how to troubleshoot them. I'm not even really sure
> what information I should supply to help with troubleshooting. Any
> guidance would be much appreciated.
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
>