[Pacemaker] crmd failure

Wed Oct 22 04:04:48 UTC 2014

> On 18 Oct 2014, at 1:52 am, John Osborne <john.osborne at arrisi.com> wrote:
> 
> I have a two node cluster which manages 4 resources in a resource group.
> Node 1 was active and was rebooted. Resources started on the second node. At
> the exact time the first node completed rebooting, crmd failed on the second
> node. Logs below. These nodes are running pacemaker-1.1.10-0.15.25
> rpm. 
> 
> Any ideas on how to determine what happened here? Problem with crmd?
> 
> Oct 15 04:46:46 vho-1-mc2 crmd[12132]:    error: crmd_node_update_complete:
> Node update 51 failed: Timer expired (-62)

We tried to update the cib but it took too long.
What else was the cluster doing at the time?

Also, the cib in 1.1.12 is two orders of magnitude faster, might be worth an upgrade

> Oct 15 04:46:46 vho-1-mc2 crmd[12132]:    error: do_log: FSA: Input I_ERROR
> from crmd_node_update_complete() received in state S_IDLE
> Oct 15 04:46:46 vho-1-mc2 crmd[12132]:   notice: do_state_transition: State
> transition S_IDLE -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL
> origin=crmd_node_update_complete ]
> Oct 15 04:46:46 vho-1-mc2 crmd[12132]:  warning: do_recover: Fast-tracking
> shutdown in response to errors
> Oct 15 04:46:46 vho-1-mc2 crmd[12132]:  warning: do_election_vote: Not
> voting in election, we're in state S_RECOVERY
> Oct 15 04:46:46 vho-1-mc2 crmd[12132]:    error: do_log: FSA: Input
> I_TERMINATE from do_recover() received in state S_RECOVERY
> Oct 15 04:46:46 vho-1-mc2 crmd[12132]:   notice: lrm_state_verify_stopped:
> Stopped 0 recurring operations at shutdown (5 ops remaining)
> Oct 15 04:46:46 vho-1-mc2 crmd[12132]:   notice: lrm_state_verify_stopped:
> Recurring action cdssRA:17 (cdssRA_monitor_15000) incomplete at shutdown
> Oct 15 04:46:46 vho-1-mc2 crmd[12132]:   notice: lrm_state_verify_stopped:
> Recurring action mcast_IP:22 (mcast_IP_monitor_5000) incomplete at shutdown
> Oct 15 04:46:46 vho-1-mc2 crmd[12132]:   notice: lrm_state_verify_stopped:
> Recurring action mgmt_IP:27 (mgmt_IP_monitor_5000) incomplete at shutdown
> Oct 15 04:46:46 vho-1-mc2 crmd[12132]:   notice: lrm_state_verify_stopped:
> Recurring action cdssDB:12 (cdssDB_monitor_30000) incomplete at shutdown
> Oct 15 04:46:46 vho-1-mc2 crmd[12132]:   notice: lrm_state_verify_stopped:
> Recurring action mcast-route:32 (mcast-route_monitor_10000) incomplete at
> shutdown
> Oct 15 04:46:46 vho-1-mc2 crmd[12132]:    error: lrm_state_verify_stopped: 6
> resources were active at shutdown.
> Oct 15 04:46:46 vho-1-mc2 crmd[12132]:   notice: do_lrm_control:
> Disconnected from the LRM
> Oct 15 04:46:46 vho-1-mc2 crmd[12132]:   notice: terminate_cs_connection:
> Disconnecting from Corosync
> Oct 15 04:46:46 vho-1-mc2 corosync[12120]:  [pcmk  ] info: pcmk_ipc_exit:
> Client crmd (conn=0x65e6d0, async-conn=0x65e6d0) left
> Oct 15 04:46:46 vho-1-mc2 crmd[12132]:    error: crmd_fast_exit: Could not
> recover from internal error
> Oct 15 04:46:47 vho-1-mc2 corosync[12120]:  [pcmk  ] ERROR:
> pcmk_wait_dispatch: Child process crmd exited (pid=12132, rc=201)
> Oct 15 04:46:47 vho-1-mc2 corosync[12120]:  [pcmk  ] info: update_member:
> Node vho-1-mc2 now has process list: 00000000000000000000000000151112 (1380626)
> Oct 15 04:46:47 vho-1-mc2 corosync[12120]:  [pcmk  ] notice:
> pcmk_wait_dispatch: Respawning failed child process: crmd
> 
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org