If i remember well, this is old bug, has been fixed<br><br><div class="gmail_quote">2012/12/7 Piotr Jewiec <span dir="ltr"><<a href="mailto:piotr@jewiec.net" target="_blank">piotr@jewiec.net</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi,<br>

<br>

I have a corosync/pacemaker cluster running on Ubuntu 10.04.2. The following error is getting appended to the syslog:<br>

<br>

Dec  6 20:44:46 filer-1 crmd: [2970]: ERROR: socket_client_channel_new: socket: Too many open files<br>

Dec  6 20:44:46 filer-1 crmd: [2970]: ERROR: init_client_ipc_comms_<u></u>nodispatch: Could not access channel on: /var/run/crm/pengine<br>

Dec  6 20:44:46 filer-1 crmd: [2970]: WARN: do_pe_control: Setup of client connection failed, not adding channel to mainloop<br>

Dec  6 20:44:46 filer-1 crmd: [2970]: WARN: do_log: FSA: Input I_FAIL from do_pe_control() received in state S_INTEGRATION<br>

Dec  6 20:44:46 filer-1 crmd: [2970]: info: do_dc_join_offer_all: join-24: Waiting on 2 outstanding join acks<br>

Dec  6 20:44:46 filer-1 crmd: [2970]: info: do_dc_takeover: Taking over DC status for this partition<br>

<br>

<br>

root@filer-1:~# lsof -p `pidof crmd` | grep socket | wc -l<br>

1019<br>

<br>

root@filer-1:~# cat /proc/2970/limits | grep 'open files'<br>

Max open files            1024                 1024                 files<br>

<br>

I almost fainted when I saw this one :)<br>

<br>

crm(live)# status<br>

============<br>

Last updated: Fri Dec  7 06:38:48 2012<br>

Stack: openais<br>

Current DC: filer-1 - partition with quorum<br>

Version: 1.0.8-<u></u>042548a451fce8400660f6031f4da6<u></u>f0223dd5dd<br>

2 Nodes configured, 2 expected votes<br>

11 Resources configured.<br>

============<br>

<br>

OFFLINE: [ filer-2 filer-1 ]<br>

<br>

As far as I'm concerned killall -9 crmd will release used FDs. Does anyone has any idea how this will work? I tested killing crmd on another cluster (without this problem) and all resources were migrated to second node. What can possibly happen in this case where cluster communication is busted? Anyone ever dealt with similar problem? Resources are currently running on filer-1, a node which had been MASTER nefore this problem occurred.<br>


<br>

Packages:<br>

<br>

pacemaker - Version: 1.0.8+hg15494-2ubuntu2<br>

corosync - Version: 1.2.0-0ubuntu1<br>

cluster-glue - Version: 1.0.5-1<br>

libcorosync4 - Version: 1.2.0-0ubuntu1<br>

libheartbeat2 - Version: 1:3.0.3-1ubuntu1<br>

<br>

Any help/advice would be really appreciated :)<span class="HOEnZb"><font color="#888888"><br>

-- <br>

--<br>

Piotr Jewiec<br>

<br>

______________________________<u></u>_________________<br>

Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org" target="_blank">Pacemaker@oss.clusterlabs.org</a><br>

<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/<u></u>mailman/listinfo/pacemaker</a><br>

<br>

Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/<u></u>doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>

</font></span></blockquote></div><br><br clear="all"><br>-- <br>esta es mi vida e me la vivo hasta que dios quiera<br>