[Pacemaker] crmd killed by signal 11 (pacemaker1.0.5/heartbeat3/RHEL5.3)

Andrew Beekhof andrew at beekhof.net
Sun Oct 11 19:26:47 UTC 2009


On Fri, Oct 9, 2009 at 7:02 AM, Li, Ling (Ling) <lli1 at alcatel-lucent.com> wrote:
> Hi,
>
> rpms are downloaded from
> http://download.opensuse.org/repositories/server:/haclustering/RHEL_5/x86_64
>
> We have a cluster of two nodes with 17 resources configured.
> Among those 17 resources, 4 clones, 1 group, and 12 primitives.
> Cluster options are
>   symmetric-cluster=true
>   stonith-enabled=false
> The rest are default.
>
>
> xmllint -relaxng /usr/share/pacemaker/pacemaker.rng cib.xml
> returns
> cib.xml validates
>
> The cluster works fine except that crmd is killed by signal 11 sporadically.
> So far I have the following four causes. The first one is the most common one.
>
> 1. Core was generated by `/usr/lib64/heartbeat/crmd'.
> Program terminated with signal 11, Segmentation fault.
> [New process 2543]
> #0  0x0000000000428d7f in te_graph_trigger ()

I'd have expected gdb to indicate a line number here.
Hard to know what the problem might be... do you have the logs for this crash?

> (gdb) bt
> #0  0x0000000000428d7f in te_graph_trigger ()
> #1  0x00002ab6d4a3df63 in crm_trigger_dispatch (source=0xda327a0, callback=0x428d34 <te_graph_trigger>,
>    userdata=0xda327a0) at mainloop.c:53
> #2  0x0000003e8dc2cdb4 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
> #3  0x0000003e8dc2fc0d in ?? () from /lib64/libglib-2.0.so.0
> #4  0x0000003e8dc2ff1a in g_main_loop_run () from /lib64/libglib-2.0.so.0
> #5  0x0000000000405020 in crmd_init ()
> #6  0x0000000000404efb in main ()
> ---------
> 2. Program terminated with signal 11, Segmentation fault.

This one is (now) fixed in
   http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/2a324fe868c1

Any reason you're running with such a high debug level?

> [New process 567]
> #0  0x0000003e8bc7b2f0 in strcasecmp () from /lib64/libc.so.6
> (gdb) bt
> #0  0x0000003e8bc7b2f0 in strcasecmp () from /lib64/libc.so.6
> #1  0x00002b71d9ac0bf7 in crm_str_eq (a=0x2b71d9ad85fb "__name__",
>    b=0x22296c6c <Address 0x22296c6c out of bounds>, use_case=0) at utils.c:1848
> #2  0x00002b71d9ac60e4 in log_data_element (function=0x43444f "do_lrm_query",
>    prefix=0x4344a1 "Current state of the LRM", log_level=9, depth=0, data=0xa96b170, formatted=1)
>    at xml.c:1175
> #3  0x00002b71d9ac4a44 in print_xml_formatted (log_level=9, function=0x43444f "do_lrm_query",
>    msg=0xa96b170, text=0x4344a1 "Current state of the LRM") at xml.c:775
> #4  0x000000000041b773 in do_lrm_query ()
> #5  0x0000000000413ebc in do_cl_join_finalize_respond ()
> #6  0x0000000000405ad8 in do_fsa_action ()
> #7  0x0000000000406620 in s_crmd_fsa_actions ()
> #8  0x0000000000405f9c in s_crmd_fsa ()
> #9  0x0000000000411b35 in crm_fsa_trigger ()
> #10 0x00002b71d9ad4f63 in crm_trigger_dispatch (source=0xa95dd70, callback=0x411acf <crm_fsa_trigger>,
>    userdata=0xa95dd70) at mainloop.c:53
> #11 0x0000003e8dc2cdb4 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
> #12 0x0000003e8dc2fc0d in ?? () from /lib64/libglib-2.0.so.0
> #13 0x0000003e8dc2ff1a in g_main_loop_run () from /lib64/libglib-2.0.so.0
> #14 0x0000000000405020 in crmd_init ()
> #15 0x0000000000404efb in main ()
> -------
> 3. Core was generated by `/usr/lib64/heartbeat/crmd'.

This isnt a "crash", Pacemaker has encountered a situation it didn't expect.
When this happens, it saves the program state (by generating a core
file) and exits so that it can be respawned and try again.

> Program terminated with signal 6, Aborted.
> [New process 4101]
> #0  0x0000003839e30215 in raise () from /lib64/libc.so.6
> (gdb) bt
> #0  0x0000003839e30215 in raise () from /lib64/libc.so.6
> #1  0x0000003839e31cc0 in abort () from /lib64/libc.so.6
> #2  0x00002ba6d6c9238d in crm_abort (file=0x430f89 "election.c",
>    function=0x430ff0 "do_election_count_vote", line=265,
>    assert_condition=0x431138 "crm_str_eq(fsa_our_uuid, election_owner, TRUE)", do_core=1, do_fork=0)
>    at utils.c:1375
> #3  0x0000000000412712 in do_election_count_vote ()
> #4  0x0000000000405ad8 in do_fsa_action ()
> #5  0x00000000004066d4 in s_crmd_fsa_actions ()
> #6  0x0000000000405f9c in s_crmd_fsa ()
> #7  0x0000000000411b35 in crm_fsa_trigger ()
> #8  0x00002ba6d6ca7f63 in crm_trigger_dispatch (source=0x143cbd70, callback=0x411acf <crm_fsa_trigger>,
>    userdata=0x143cbd70) at mainloop.c:53
> #9  0x000000383be2cdb4 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
> #10 0x000000383be2fc0d in ?? () from /lib64/libglib-2.0.so.0
> #11 0x000000383be2ff1a in g_main_loop_run () from /lib64/libglib-2.0.so.0
> #12 0x0000000000405020 in crmd_init ()
> #13 0x0000000000404efb in main ()
> ---
> 4. Core was generated by `/usr/lib64/heartbeat/crmd'

This is the same as 2. which is now fixed.

> Program terminated with signal 11, Segmentation fault.
> [New process 6817]
> #0  0x0000003839e7b2f0 in strcasecmp () from /lib64/libc.so.6
> (gdb) bt
> #0  0x0000003839e7b2f0 in strcasecmp () from /lib64/libc.so.6
> #1  0x00002b57db35dbf7 in crm_str_eq (a=0x2b57db3755fb "__name__",
>    b=0xb96ac02022296c6c <Address 0xb96ac02022296c6c out of bounds>, use_case=0) at utils.c:1848
> #2  0x00002b57db3630e4 in log_data_element (function=0x43444f "do_lrm_query",
>    prefix=0x4344a1 "Current state of the LRM", log_level=9, depth=0, data=0x6b95f00, formatted=1)
>    at xml.c:1175
> #3  0x00002b57db361a44 in print_xml_formatted (log_level=9, function=0x43444f "do_lrm_query",
>    msg=0x6b95f00, text=0x4344a1 "Current state of the LRM") at xml.c:775
> #4  0x000000000041b773 in do_lrm_query ()
> #5  0x0000000000413ebc in do_cl_join_finalize_respond ()
> #6  0x0000000000405ad8 in do_fsa_action ()
> #7  0x0000000000406620 in s_crmd_fsa_actions ()
> #8  0x0000000000405f9c in s_crmd_fsa ()
> #9  0x0000000000411b35 in crm_fsa_trigger ()
> #10 0x00002b57db371f63 in crm_trigger_dispatch (source=0x6b91d70, callback=0x411acf <crm_fsa_trigger>,
>    userdata=0x6b91d70) at mainloop.c:53
> #11 0x000000383be2cdb4 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
> #12 0x000000383be2fc0d in ?? () from /lib64/libglib-2.0.so.0
> #13 0x000000383be2ff1a in g_main_loop_run () from /lib64/libglib-2.0.so.0
> #14 0x0000000000405020 in crmd_init ()
> #15 0x0000000000404efb in main ()
> ----
>
>
> I ran crm_verify -VVVVV -L
> The output has
> 1. for each line of cib.xml there is a message :" debug: debug2: log_data_element: get_xpath_object: Bad Input".
> 2. many find_xml_node such as
>   find_xml_node: Could not find operations in clone.
>   find_xml_node: Could not find group in clone.
>   Find_xml_node: Could not find operations in primitive. (but each primitive has a monitor operation)
>
> 3. Warnings found during check: config may not be valid
>
> My questions:
>
> 1. Can I ignore 1 and 2 since cib.xml passed the xmllint validation?
> 2. which tool can I use to make sure the cib.xml is absolute correct?
>
> Thanks,
>
> Ling Li
>
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>




More information about the Pacemaker mailing list