[Pacemaker] lrmadmin -C blocks on subsequent invocations

Mon Nov 22 12:28:41 EST 2010

In an (increasingly desperate) attepmt to get a stack that works with
upstart on ubuntu I have recompiled from source (As per
http://www.clusterlabs.org/wiki/Install#From_Source) on a clean maverick
64 bit server).

When running lradmin -C to list classes the first time it comes back
immediately with the expected list
root at node1:/home# lrmadmin -C
There are 5 RA classes supported:
lsb
ocf
stonith
upstart
heartbeat

All subsequent attempts block and never comes back (you have to kill
with crtl-C). This is repeatable on all the machines I have tried it on.
reboot appears to be the only cure as corosync stop
baulks on 
Waiting for corosync services to unload:.........

strace suggests that lrmadmin has stuck on 
/var/run.heartbeat/lrm_cmd_sock reporting "resource temporarily
unavailable" as per below:

17:43:41.328500 connect(3, {sa_family=AF_FILE,
path="/var/run/heartbeat/lrm_cmd_sock"}, 110) = 0
17:43:41.328572 getsockopt(3, SOL_SOCKET, SO_PEERCRED,
"\t\4\0\0\0\0\0\0\0\0\0\0", [12]) = 0
17:43:41.328788 getegid()               = 0
17:43:41.328846 getuid()                = 0
17:43:41.328970 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
17:43:41.329050 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
17:43:41.329154 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
17:43:41.329202 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
17:43:41.329263 sendto(3,
"F\0\0\0\315\253\0\0>>>\nlrm_t=reg\nlrm_app=lr"..., 78,
MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 78
17:43:41.329337 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
17:43:41.329380 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
17:43:41.329420 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
17:43:41.329458 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
17:43:41.329497 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
17:43:41.329535 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
17:43:41.329574 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
17:43:41.329613 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
17:43:41.329651 poll([{fd=3, events=POLLIN}], 1, -1 <unfinished ...>

lrmd process is still alive and there is nothing logged in
/var/log/daemon.log

Other commands like "crm configure verify" does the same thing although
I have not traced these.
I havent tried recompiling without upstart support as I specifically
need that but I have a suspicion it might be related. Maybe its dbus
related.

Versions are
Cluster-Resource-Agents-051972b5cfd
Pacemaker-1-0-b2e39d318fda
Reusable-Cluster-Components-8658bcdd4511
flatiron - not sure but downloaded Friday 19th

Anybody seen this characteristic or know how to debug further?

Thanks

Dave