<div dir="ltr">attach the corosync.conf<div><br></div><div>------------------------------------</div><div><div>compatibility: whitetank</div><div>totem {</div><div> version: 2</div><div> token: 10000</div><div> token_retransmits_before_loss_const: 10</div>
<div> secauth: off</div><div> threads: 0</div><div> interface {</div><div> ringnumber: 0</div><div> member: {</div><div> memberaddr: 10.0.0.1</div><div> }</div>
<div> member: {</div><div> memberaddr: 10.0.0.2</div><div> }</div><div> bindnetaddr: 10.0.0.1</div><div> mcastport: 5405</div><div> ttl: 1</div><div>
}</div><div> transport: udpu</div><div>}</div><div>logging {</div><div> fileline: off</div><div> to_stderr: no</div><div> to_logfile: yes</div><div> to_syslog: yes</div><div> syslog_facility: local6</div>
<div> syslog_priority: debug</div><div> debug:on</div><div> logfile: /var/log/cluster/corosync.log</div><div> timestamp: on</div><div> logger_subsys {</div><div> subsys: AMF</div><div> debug: off</div>
<div> }</div><div>}</div><div>amf {</div><div> mode: disabled</div><div>}</div><div>service{</div><div> ver:1</div><div> name:pacemaker</div><div>}</div><div>aisexec{</div><div> user:root</div><div> group:root</div>
<div>}</div></div><div><br></div><div>-----------------------------------</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">2014-07-18 10:35 GMT+08:00 Emre He <span dir="ltr"><<a href="mailto:emre.he@gmail.com" target="_blank">emre.he@gmail.com</a>></span>:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi, <div><br></div><div>I am working a classic corosync+pacemaker linux-HA cluster (2 servers), after reboot one server, when it come back, corosync is running, pacemaker is dead. </div>
<div><br></div><div>in corosync.log, we can see as below: </div>
<div>--------------------------------------------------------</div><div><div>Jul 17 03:56:04 [2068] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> crmd: info: crmd_exit: <span style="white-space:pre-wrap"> </span>Dropping I_TERMINATE: [ state=S_STOPPING cause=C_FSA_INTERNAL origin=do_stop ]</div>
<div>Jul 17 03:56:04 [2068] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> crmd: debug: lrm_state_verify_stopped: <span style="white-space:pre-wrap"> </span>Checking for active resources before exit</div>
<div>Jul 17 03:56:04 [2068] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> crmd: info: crmd_cs_destroy: <span style="white-space:pre-wrap"> </span>connection closed</div>
<div>Jul 17 03:56:04 [2068] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> crmd: info: crmd_init: <span style="white-space:pre-wrap"> </span>Inhibiting automated respawn</div><div><b>Jul 17 03:56:04 [2068] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> crmd: info: crmd_init: <span style="white-space:pre-wrap"> </span>2068 stopped: Network is down (100)</b></div>
<div><b>Jul 17 03:56:04 [2068] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> crmd: warning: crmd_fast_exit: <span style="white-space:pre-wrap"> </span>Inhibiting respawn: 100 -> 100</b></div><div>
Jul 17 03:56:04 [2068] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> crmd: info: crm_xml_cleanup: <span style="white-space:pre-wrap"> </span>Cleaning up memory from libxml2</div>
<div>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> pacemakerd: debug: qb_ipcs_dispatch_connection_request: <span style="white-space:pre-wrap"> </span>HUP conn <a href="tel:%282057-2068-14" value="+12057206814" target="_blank">(2057-2068-14</a>)</div>
<div>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> pacemakerd: debug: qb_ipcs_disconnect: <span style="white-space:pre-wrap"> </span>qb_ipcs_disconnect<a href="tel:%282057-2068-14" value="+12057206814" target="_blank">(2057-2068-14</a>) state:2</div>
<div>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> pacemakerd: info: crm_client_destroy: <span style="white-space:pre-wrap"> </span>Destroying 0 events</div><div>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> pacemakerd: debug: qb_rb_close: <span style="white-space:pre-wrap"> </span>Free'ing ringbuffer: /dev/shm/qb-pacemakerd-response-<a href="tel:2057-2068-14" value="+12057206814" target="_blank">2057-2068-14</a>-header</div>
<div>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> pacemakerd: debug: qb_rb_close: <span style="white-space:pre-wrap"> </span>Free'ing ringbuffer: /dev/shm/qb-pacemakerd-event-<a href="tel:2057-2068-14" value="+12057206814" target="_blank">2057-2068-14</a>-header</div>
<div>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> pacemakerd: debug: qb_rb_close: <span style="white-space:pre-wrap"> </span>Free'ing ringbuffer: /dev/shm/qb-pacemakerd-request-<a href="tel:2057-2068-14" value="+12057206814" target="_blank">2057-2068-14</a>-header</div>
<div>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> pacemakerd: error: pcmk_child_exit: <span style="white-space:pre-wrap"> </span>Child process crmd (2068) exited: Network is down (100)</div>
<div>
Jul 17 03:56:04 [2057] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> pacemakerd: warning: pcmk_child_exit: <span style="white-space:pre-wrap"> </span>Pacemaker child process crmd no longer wishes to be respawned. Shutting ourselves down.</div>
<div>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> pacemakerd: debug: update_node_processes: <span style="white-space:pre-wrap"> </span>Node <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> now has process list: 00000000000000000000000000111112 (was 00000000000000000000000000111312)</div>
<div><b>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> pacemakerd: notice: pcmk_shutdown_worker: <span style="white-space:pre-wrap"> </span>Shuting down Pacemaker</b></div><div><b>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> pacemakerd: debug: pcmk_shutdown_worker: <span style="white-space:pre-wrap"> </span>crmd confirmed stopped</b></div>
<div>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> pacemakerd: notice: stop_child: <span style="white-space:pre-wrap"> </span>Stopping pengine: Sent -15 to process 2067</div><div>Jul 17 03:56:04 [2067] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> pengine: info: crm_signal_dispatch: <span style="white-space:pre-wrap"> </span>Invoking handler for signal 15: Terminated</div>
<div>Jul 17 03:56:04 [2067] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> pengine: info: qb_ipcs_us_withdraw: <span style="white-space:pre-wrap"> </span>withdrawing server sockets</div></div><div><br>
</div><div>
<br></div><div><div>Jul 17 03:56:04 [2063] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> cib: debug: qb_ipcs_unref: <span style="white-space:pre-wrap"> </span>qb_ipcs_unref() - destroying</div><div>
Jul 17 03:56:04 [2063] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> cib: info: crm_xml_cleanup: <span style="white-space:pre-wrap"> </span>Cleaning up memory from libxml2</div>
<div>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> pacemakerd: info: pcmk_child_exit: <span style="white-space:pre-wrap"> </span>Child process cib (2063) exited: OK (0)</div><div>
Jul 17 03:56:04 [2057] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> pacemakerd: debug: update_node_processes: <span style="white-space:pre-wrap"> </span>Node <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> now has process list: 00000000000000000000000000000002 (was 00000000000000000000000000000102)</div>
<div>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> pacemakerd: warning: qb_ipcs_event_sendv: <span style="white-space:pre-wrap"> </span>new_event_notification <a href="tel:%282057-2063-13" value="+12057206313" target="_blank">(2057-2063-13</a>): Broken pipe (32)</div>
<div><b>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> pacemakerd: debug: pcmk_shutdown_worker: <span style="white-space:pre-wrap"> </span>cib confirmed stopped</b></div><div><b>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> pacemakerd: notice: pcmk_shutdown_worker: <span style="white-space:pre-wrap"> </span>Shutdown complete</b></div>
<div>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> pacemakerd: notice: pcmk_shutdown_worker: <span style="white-space:pre-wrap"> </span>Attempting to inhibit respawning after fatal error</div>
<div>
Jul 17 03:56:04 [2057] <a href="http://foo.bar.com" target="_blank">foo.bar.com</a> pacemakerd: info: crm_xml_cleanup: <span style="white-space:pre-wrap"> </span>Cleaning up memory from libxml2</div><div>Jul 17 03:56:04 corosync [CPG ] exit_fn for conn=0x17e3a20</div>
<div>Jul 17 03:56:04 corosync [pcmk ] WARN: route_ais_message: Sending message to local.stonith-ng failed: ipc delivery failed (rc=-2)</div><div>Jul 17 03:56:04 corosync [CPG ] got procleave message from cluster node 433183754</div>
<div>Jul 17 03:56:07 corosync [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)</div><div><b>Jul 17 03:56:19 corosync [pcmk ] WARN: route_ais_message: Sending message to local.stonith-ng failed: ipc delivery failed (rc=-2)</b></div>
<div><b>Jul 17 03:56:19 corosync [pcmk ] WARN: route_ais_message: Sending message to local.stonith-ng failed: ipc delivery failed (rc=-2)</b></div></div><div>--------------------------------------------------------<br></div>
<div><br></div><div>here is my HA cluster parameters and package versions</div><div>--------------------------------------------------------<br></div><div><div>property cib-bootstrap-options: \</div><div> dc-version=1.1.10-1.el6_4.4-368c726 \</div>
<div> cluster-infrastructure="classic openais (with plugin)" \</div><div> expected-quorum-votes=2 \</div><div> stonith-enabled=false \</div><div> no-quorum-policy=ignore \</div><div>
start-failure-is-fatal=false \</div>
<div> default-action-timeout=300s</div><div>rsc_defaults rsc-options: \</div><div> resource-stickiness=100</div></div><div><br></div><div><div><br></div><div>pacemaker-1.1.10-1.el6_4.4.x86_64</div><div>corosync-1.4.1-15.el6_4.1.x86_64</div>
<div><br></div></div><div>--------------------------------------------------------</div><div><br></div><div>I am not sure if network has flash disconnection, both servers are VMware VMs, but looks logs show that. </div><div>
so is it the root cause of unexpected network issues? actually I understand that's what HA should handle. </div><div>or any other clue about the root cause? </div><div><br></div><div>many thanks, </div><span class="HOEnZb"><font color="#888888"><div>
Emre</div>
</font></span></div>
</blockquote></div><br></div>