[Pacemaker] lrmadmin -C blocks on subsequent invocations

Mon Nov 22 15:55:15 EST 2010

In an (increasingly desperate) attempt to get a stack that works with
upstart on ubuntu I have recompiled from source (as per
http://www.clusterlabs.org/wiki/Install#From_Source) on a clean maverick
64 bit server).

When running lradmin -C to list classes the first time it comes back
immediately with the expected list
root at node1:/home# lrmadmin -C
There are 5 RA classes supported:
lsb
ocf
stonith
upstart
heartbeat

All subsequent attempts hang and never comes back (you have to kill
with crtl-C). This is repeatable on all the machines I have tried it on.
reboot appears to be the only cure as corosync stop
baulks on 
Waiting for corosync services to unload:.........

Is this a related fault or something different? I have seen it before on
other builds and seen posts that appear to report it.

Anyway strace suggests that lrmadmin has stuck on 
/var/run.heartbeat/lrm_cmd_sock reporting "resource temporarily
unavailable" but never responds to the outbound message :

17:43:41.328500 connect(3, {sa_family=AF_FILE,
path="/var/run/heartbeat/lrm_cmd_sock"}, 110) = 0
17:43:41.328572 getsockopt(3, SOL_SOCKET, SO_PEERCRED,
"\t\4\0\0\0\0\0\0\0\0\0\0", [12]) = 0
17:43:41.328788 getegid()               = 0
17:43:41.328846 getuid()                = 0
17:43:41.328970 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
17:43:41.329050 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
17:43:41.329154 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
17:43:41.329202 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
17:43:41.329263 sendto(3,
"F\0\0\0\315\253\0\0>>>\nlrm_t=reg\nlrm_app=lr"..., 78,
MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 78
17:43:41.329337 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
17:43:41.329380 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
17:43:41.329420 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
17:43:41.329458 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
17:43:41.329497 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
17:43:41.329535 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
17:43:41.329574 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
17:43:41.329613 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
17:43:41.329651 poll([{fd=3, events=POLLIN}], 1, -1 <unfinished ...>

lrmd process is still alive and there is nothing logged in
/var/log/daemon.log. Its strace implies it never even saw the request on
the socket. The process still has 3 file handles open on it:
root at node1:~# lsof /var/run/heartbeat/lrm_cmd_sock
COMMAND  PID USER   FD   TYPE             DEVICE SIZE/OFF  NODE NAME
lrmd    1420 root    3u  unix 0xffff88001e011040      0t0  8732
/var/run/heartbeat/lrm_cmd_sock
lrmd    1420 root    9u  unix 0xffff88001e0b4d00      0t0  8782
/var/run/heartbeat/lrm_cmd_sock
lrmd    1420 root   11u  unix 0xffff88001e1a9d40      0t0 10211
/var/run/heartbeat/lrm_cmd_sock

A good strace (ie lradmin -C after a reboot) starts identically to the
strace above but receives a response from lrmd:
...
20:12:48.774239 poll([{fd=3, events=POLLIN}], 1, -1) = 1 ([{fd=3,
revents=POLLIN}])
20:12:48.774603 recvfrom(3, "
\0\0\0\315\253\0\0>>>\nlrm_t=return\nlrm_ret"..., 4048, MSG_DONTWAIT,
NULL, NULL) = 40
20:12:48.774661 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
20:12:48.774709 recvfrom(3, 0x1049e98, 4008, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
20:12:48.774756 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
20:12:48.774804 recvfrom(3, 0x1049e98, 4008, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
20:12:48.774851 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
20:12:48.774898 recvfrom(3, 0x1049e98, 4008, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
20:12:48.774945 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
20:12:48.775161 recvfrom(3, 0x1049e98, 4008, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
20:12:48.775210 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
20:12:48.775257 recvfrom(3, 0x1049e98, 4008, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
20:12:48.775304 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
20:12:48.775444 socket(PF_FILE, SOCK_STREAM, 0) = 4
20:12:48.775610 fcntl(4, F_GETFL)       = 0x2 (flags O_RDWR)
20:12:48.775686 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
20:12:48.775841 connect(4, {sa_family=AF_FILE,
path="/var/run/heartbeat/lrm_callback_sock"}, 110) = 0
20:12:48.775907 getsockopt(4, SOL_SOCKET, SO_PEERCRED,
"\214\5\0\0\0\0\0\0\0\0\0\0", [12]) = 0
...

Other commands like "crm configure verify" exhibits the same "hang"
although I have not traced these. I guess they must use lrmd too.

I havent tried recompiling without upstart support as I specifically
need that but I have a suspicion it might be related. Maybe it has
something to do with dbus although a "good" command seems to complete
without obvious error.

Versions are
Cluster-Resource-Agents-051972b5cfd
Pacemaker-1-0-b2e39d318fda
Reusable-Cluster-Components-8658bcdd4511
flatiron - not sure but downloaded Friday 19th

Anybody seen this characteristic or know how best for me to debug further?

Thanks
Dave