[Pacemaker] pacemaker is down after server reboot, corosync.log show "network is down(100)", "Shuting down Pacemaker"

Thu Jul 17 22:35:10 EDT 2014

Hi,

I am working a classic corosync+pacemaker linux-HA cluster (2 servers),
after reboot one server, when it come back, corosync is running, pacemaker
is dead.

in corosync.log, we can see as below:
--------------------------------------------------------
Jul 17 03:56:04 [2068] foo.bar.com       crmd:     info: crmd_exit: Dropping
I_TERMINATE: [ state=S_STOPPING cause=C_FSA_INTERNAL origin=do_stop ]
Jul 17 03:56:04 [2068] foo.bar.com       crmd:    debug:
lrm_state_verify_stopped: Checking for active resources before exit
Jul 17 03:56:04 [2068] foo.bar.com       crmd:     info:
crmd_cs_destroy: connection
closed
Jul 17 03:56:04 [2068] foo.bar.com       crmd:     info: crmd_init: Inhibiting
automated respawn
*Jul 17 03:56:04 [2068] foo.bar.com <http://foo.bar.com>       crmd:
info: crmd_init: 2068 stopped: Network is down (100)*
*Jul 17 03:56:04 [2068] foo.bar.com <http://foo.bar.com>       crmd:
 warning: crmd_fast_exit: Inhibiting respawn: 100 -> 100*
Jul 17 03:56:04 [2068] foo.bar.com       crmd:     info:
crm_xml_cleanup: Cleaning
up memory from libxml2
Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:    debug:
qb_ipcs_dispatch_connection_request: HUP conn (2057-2068-14)
Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:    debug:
qb_ipcs_disconnect: qb_ipcs_disconnect(2057-2068-14) state:2
Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:     info:
crm_client_destroy: Destroying 0 events
Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:    debug: qb_rb_close: Free'ing
ringbuffer: /dev/shm/qb-pacemakerd-response-2057-2068-14-header
Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:    debug: qb_rb_close: Free'ing
ringbuffer: /dev/shm/qb-pacemakerd-event-2057-2068-14-header
Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:    debug: qb_rb_close: Free'ing
ringbuffer: /dev/shm/qb-pacemakerd-request-2057-2068-14-header
Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:    error: pcmk_child_exit: Child
process crmd (2068) exited: Network is down (100)
Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:  warning:
pcmk_child_exit: Pacemaker
child process crmd no longer wishes to be respawned. Shutting ourselves
down.
Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:    debug:
update_node_processes: Node foo.bar.com now has process list:
00000000000000000000000000111112 (was 00000000000000000000000000111312)
*Jul 17 03:56:04 [2057] foo.bar.com <http://foo.bar.com> pacemakerd:
notice: pcmk_shutdown_worker: Shuting down Pacemaker*
*Jul 17 03:56:04 [2057] foo.bar.com <http://foo.bar.com> pacemakerd:
 debug: pcmk_shutdown_worker: crmd confirmed stopped*
Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:   notice: stop_child: Stopping
pengine: Sent -15 to process 2067
Jul 17 03:56:04 [2067] foo.bar.com    pengine:     info:
crm_signal_dispatch: Invoking handler for signal 15: Terminated
Jul 17 03:56:04 [2067] foo.bar.com    pengine:     info:
qb_ipcs_us_withdraw: withdrawing server sockets

Jul 17 03:56:04 [2063] foo.bar.com        cib:    debug:
qb_ipcs_unref: qb_ipcs_unref()
- destroying
Jul 17 03:56:04 [2063] foo.bar.com        cib:     info:
crm_xml_cleanup: Cleaning
up memory from libxml2
Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:     info: pcmk_child_exit: Child
process cib (2063) exited: OK (0)
Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:    debug:
update_node_processes: Node foo.bar.com now has process list:
00000000000000000000000000000002 (was 00000000000000000000000000000102)
Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:  warning:
qb_ipcs_event_sendv: new_event_notification (2057-2063-13): Broken pipe (32)
*Jul 17 03:56:04 [2057] foo.bar.com <http://foo.bar.com> pacemakerd:
 debug: pcmk_shutdown_worker: cib confirmed stopped*
*Jul 17 03:56:04 [2057] foo.bar.com <http://foo.bar.com> pacemakerd:
notice: pcmk_shutdown_worker: Shutdown complete*
Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:   notice:
pcmk_shutdown_worker: Attempting to inhibit respawning after fatal error
Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:     info:
crm_xml_cleanup: Cleaning
up memory from libxml2
Jul 17 03:56:04 corosync [CPG   ] exit_fn for conn=0x17e3a20
Jul 17 03:56:04 corosync [pcmk  ] WARN: route_ais_message: Sending message
to local.stonith-ng failed: ipc delivery failed (rc=-2)
Jul 17 03:56:04 corosync [CPG   ] got procleave message from cluster node
433183754
Jul 17 03:56:07 corosync [pcmk  ] WARN: route_ais_message: Sending message
to local.cib failed: ipc delivery failed (rc=-2)
*Jul 17 03:56:19 corosync [pcmk  ] WARN: route_ais_message: Sending message
to local.stonith-ng failed: ipc delivery failed (rc=-2)*
*Jul 17 03:56:19 corosync [pcmk  ] WARN: route_ais_message: Sending message
to local.stonith-ng failed: ipc delivery failed (rc=-2)*
--------------------------------------------------------

here is my HA cluster parameters and package versions
--------------------------------------------------------
property cib-bootstrap-options: \
        dc-version=1.1.10-1.el6_4.4-368c726 \
        cluster-infrastructure="classic openais (with plugin)" \
        expected-quorum-votes=2 \
        stonith-enabled=false \
        no-quorum-policy=ignore \
        start-failure-is-fatal=false \
        default-action-timeout=300s
rsc_defaults rsc-options: \
        resource-stickiness=100

pacemaker-1.1.10-1.el6_4.4.x86_64
corosync-1.4.1-15.el6_4.1.x86_64

--------------------------------------------------------

I am not sure if network has flash disconnection, both servers are VMware
VMs, but looks logs show that.
so is it the root cause of unexpected network issues? actually I understand
that's what HA should handle.
or any other clue about the root cause?

many thanks,
Emre
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140718/9081a154/attachment-0002.html>