[Pacemaker] do_lrm_control: Failed to sign on to the LRM repeatedly!

Fri Nov 19 02:50:09 EST 2010

On Fri, Nov 19, 2010 at 12:08 AM, Dave Williams
<dave at opensourcesolutions.co.uk> wrote:
> I have prolem with a cluster that wont start up. It is running a 2 node
> failover (master slave) clustered ftp server using drbd to duplicate the
> filesystem.
>
> Upgraded from 10.04 Lucid to 10.10 Maverick to obtain support for
> upstart resource agents.
> Running:
>  pacemaker 1.0.9.1-2ubuntu4
>  corosync 1.2.1-1ubuntu1
>  cluster-agents 1:1.0.3-3
>
> Before the upgrade it was working reasonably OK (except for detecting
> vsftpd running which I diagnosed as being due to upstart having hijacked
> the lsb compliant sysv startup script and replaced it with its own
> non-compliant version).
>
> daemon Logs show:
>
> crmd: WARN: lrm_signon: can not initiate connection
> crmd: [4963]: WARN: do_lrm_control: Failed to sign on to the LRM 29 (30
> max) time
>
> netstat -anp shows:
> unix 2 [ ] DGRAM 22204 4546/lrmd
>
> which implies at least part of lrmd is running.
> I dont know what this implies but I cannot find any unix sockets in the
> filing system
>
> ps axf shows::
> 25525 ?        Ssl    0:00 /usr/sbin/corosync
> 25532 ?        SLs    0:00  \_ /usr/lib/heartbeat/stonithd
> 25533 ?        S      0:00  \_ /usr/lib/heartbeat/cib
> 25534 ?        Z      0:00  \_ [lrmd] <defunct>
> 25535 ?        S      0:00  \_ /usr/lib/heartbeat/attrd
> 25536 ?        Z      0:00  \_ [pengine] <defunct>
> 25537 ?        S      0:00  \_ /usr/lib/heartbeat/crmd
> 25540 ?        S      0:00  \_ /usr/lib/heartbeat/cib
> 25541 ?        S      0:00  \_ /usr/lib/heartbeat/lrmd
> 25542 ?        S      0:00  \_ /usr/lib/heartbeat/attrd
> 25543 ?        S      0:00  \_ /usr/lib/heartbeat/pengine
> 25547 ?        Z      0:00  \_ [corosync] <defunct>
> 25548 ?        Z      0:00  \_ [corosync] <defunct>
> 25553 ?        Z      0:00  \_ [corosync] <defunct>
> 25555 ?        Z      0:00  \_ [corosync] <defunct>
> 25866 ?        S      0:00  \_ /usr/lib/heartbeat/crmd

This install is seriously sick.
Multiple copies of all our daemons.

If I had to guess, I'd say there were version incompatibilities
between the various cluster packages.

>
> (This was from another run so the pids differ from above).
>
> crm_mon -1 shows:
>
> ============
> Last updated: Wed Nov 17 00:13:25 2010
> Stack: openais
> Current DC: NONE
> 2 Nodes configured, 2 expected votes
> 2 Resources configured.
> ============
>
> OFFLINE: [ node1 node2 ]
>
> Clearly the Current DC:NONE is the symptom that results from lrmd not
> being communicative
>
> strace analysis shows initial (defunct) lrmd creating
> "/var/run/heartbeat/lrm_cmd_sock" and ..callback_sock
> then being terminated via a SIGTERM kill about 1 second later by the 2nd lrmd
> instance that continues running. This appears to cause the first
> instance to delete the socket.
>
> I havent followed the src enough yet to understand whether this is expected
> or an erroneous condition but it appears the missing socket is the cause
> of the error messages. Whether this is why my cluster wont start I am
> not 100% sure.
>
> It may be some form of timing condition because I did manage to get the
> stack running once via corosync stops and starts with a random delay in
> between.
> (I note that "/etc/init.d/corosync stop" leaves some processes running!)
>
> Can anyone help me debug and find root cause and a solution?
>
> Thanks
>
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>