[Pacemaker] An internal error occurred in crmd

Wed Oct 30 20:13:55 EDT 2013

I think this should be fixed by:
   https://github.com/beekhof/pacemaker/commit/ea7991f

The underlying issue though, is that the lrmd command timed out, which _should_ have been fixed by:
   https://github.com/beekhof/pacemaker/commit/d65b270

What are you doing to this poor cluster? :)

On 21 Oct 2013, at 3:59 pm, Kazunori INOUE <kazunori.inoue3 at gmail.com> wrote:

> Hi,
> 
> I'm using pacemaker-1.1 (b6d42ed. the latest devel).
> 
> After having started corosync and pacemaker with three nodes,
> I loaded configuration.
> Then internal error occurred in crmd and was exited.
> 
> $ crm configure load update 3vm+2stonith.cli
> $ for i in n{6..8};do ssh $i 'grep error: /var/log/ha-log';done
> Oct 21 11:19:43 bl460g1n6 pengine[7684]:    error: unpack_resources:
> Resource start-up disabled since no STONITH resources have been
> defined
> Oct 21 11:19:43 bl460g1n6 pengine[7684]:    error: unpack_resources:
> Either configure some or disable STONITH with the stonith-enabled
> option
> Oct 21 11:19:43 bl460g1n6 pengine[7684]:    error: unpack_resources:
> NOTE: Clusters with shared data need STONITH to ensure data integrity
> Oct 21 11:20:51 bl460g1n6 crmd[7685]:    error: crm_element_value:
> Couldn't find lrmd_callid in NULL
> Oct 21 11:20:51 bl460g1n6 crmd[7685]:    error: crm_abort:
> crm_element_value: Triggered assert at xml.c:3336 : data != NULL
> Oct 21 11:20:51 bl460g1n6 crmd[7685]:    error: crm_element_value:
> Couldn't find lrmd_rc in NULL
> Oct 21 11:20:51 bl460g1n6 crmd[7685]:    error: crm_abort:
> crm_element_value: Triggered assert at xml.c:3336 : data != NULL
> Oct 21 11:20:53 bl460g1n6 crmd[7685]:    error:
> internal_ipc_get_reply: Discarding old reply 90 (need 91)
> 
> Oct 21 11:20:51 bl460g1n7 crmd[12487]:    error: lrmd_send_command:
> Couldn't perform lrmd_rsc_info operation (timeout=30000): -11:
> Connection timed out (110)
> Oct 21 11:20:52 bl460g1n7 crmd[12487]:    error: lrmd_send_command:
> Couldn't perform lrmd_rsc_register operation (timeout=0): -114:
> Connection timed out (110)
> Oct 21 11:20:52 bl460g1n7 crmd[12487]:    error: lrmd_send_command:
> Couldn't perform lrmd_rsc_info operation (timeout=30000): -114:
> Connection timed out (110)
> Oct 21 11:20:52 bl460g1n7 crmd[12487]:    error: get_lrm_resource:
> Could not add resource prmStonith6-2 to LRM
> Oct 21 11:20:52 bl460g1n7 crmd[12487]:    error: do_lrm_invoke:
> Invalid resource definition
> Oct 21 11:20:52 bl460g1n7 crmd[12487]:    error: do_log: FSA: Input
> I_TERMINATE from do_recover() received in state S_RECOVERY
> Oct 21 11:20:52 bl460g1n7 crmd[12487]:    error:
> lrm_state_verify_stopped: 4 pending LRM operations at shutdown
> Oct 21 11:20:52 bl460g1n7 crmd[12487]:    error:
> lrm_state_verify_stopped: Pending action: prmVM3:13 (prmVM3_monitor_0)
> Oct 21 11:20:52 bl460g1n7 crmd[12487]:    error:
> lrm_state_verify_stopped: Pending action: prmVM2:9 (prmVM2_monitor_0)
> Oct 21 11:20:52 bl460g1n7 crmd[12487]:    error:
> lrm_state_verify_stopped: Pending action: prmVM1:5 (prmVM1_monitor_0)
> Oct 21 11:20:52 bl460g1n7 crmd[12487]:    error:
> lrm_state_verify_stopped: Pending action: prmStonith6-1:17
> (prmStonith6-1_monitor_0)
> Oct 21 11:20:52 bl460g1n7 crmd[12487]:    error: crmd_fast_exit: Could
> not recover from internal error
> Oct 21 11:20:52 bl460g1n7 pacemakerd[12477]:    error:
> pcmk_child_exit: Child process crmd (12487) exited: Generic Pacemaker
> error (201)
> 
> Oct 21 11:20:51 bl460g1n8 crmd[1600]:    error: lrmd_send_command:
> Couldn't perform lrmd_rsc_info operation (timeout=30000): -11:
> Connection timed out (110)
> Oct 21 11:20:52 bl460g1n8 crmd[1600]:    error: lrmd_send_command:
> Couldn't perform lrmd_rsc_register operation (timeout=0): -114:
> Connection timed out (110)
> Oct 21 11:20:52 bl460g1n8 crmd[1600]:    error: lrmd_send_command:
> Couldn't perform lrmd_rsc_info operation (timeout=30000): -114:
> Connection timed out (110)
> Oct 21 11:20:52 bl460g1n8 crmd[1600]:    error: get_lrm_resource:
> Could not add resource prmStonith6-2 to LRM
> Oct 21 11:20:52 bl460g1n8 crmd[1600]:    error: do_lrm_invoke: Invalid
> resource definition
> Oct 21 11:20:52 bl460g1n8 crmd[1600]:    error: do_log: FSA: Input
> I_TERMINATE from do_recover() received in state S_RECOVERY
> Oct 21 11:20:52 bl460g1n8 crmd[1600]:    error:
> lrm_state_verify_stopped: 4 pending LRM operations at shutdown
> Oct 21 11:20:52 bl460g1n8 crmd[1600]:    error:
> lrm_state_verify_stopped: Pending action: prmVM3:13 (prmVM3_monitor_0)
> Oct 21 11:20:52 bl460g1n8 crmd[1600]:    error:
> lrm_state_verify_stopped: Pending action: prmVM2:9 (prmVM2_monitor_0)
> Oct 21 11:20:52 bl460g1n8 crmd[1600]:    error:
> lrm_state_verify_stopped: Pending action: prmVM1:5 (prmVM1_monitor_0)
> Oct 21 11:20:52 bl460g1n8 crmd[1600]:    error:
> lrm_state_verify_stopped: Pending action: prmStonith6-1:17
> (prmStonith6-1_monitor_0)
> Oct 21 11:20:52 bl460g1n8 crmd[1600]:    error: crmd_fast_exit: Could
> not recover from internal error
> Oct 21 11:20:52 bl460g1n8 pacemakerd[1591]:    error: pcmk_child_exit:
> Child process crmd (1600) exited: Generic Pacemaker error (201)
> 
> Best Regards,
> Kazunori INOUE
> <crmd_internal_error.tar.bz2>_______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org