[Pacemaker] [Problem]Cib cannot update an attribute by 16 node constitution.

Mon Aug 2 23:12:23 EDT 2010

Hi Andrew,

I changed cluster option to batch-limit=3,I re-tried it.
However, similar time-out occurs.

I measured processing just before the time-out(120s) in systemtap.
The following only the function long time.
-----
probe start! ---------------------------------
  cib_process_request  [call-count:179][117,540,173,155 nsec]
  cib_process_command  [call:179]      [116,471,047,275 nsec]
cib_process_command  call function ---
  cib_config_changed   [call:179]      [101,169,909,572 nsec]
cib_config_changed   call function ---
  calculate_xml_digest [call:179]      [ 68,820,560,745 nsec]
  create_xml_node      [call:3012263]  [ 19,855,469,976 nsec]※
  xpath_search         [call:179]      [    145,030,232 nsec]
  diff_xml_object      [call:179]      [ 32,677,359,476 nsec]※
calculate_xml_digest call function ---
  sorted_xml           [call:1505799]  [ 52,512,465,838 nsec]※
  copy_xml             [call:179]      [  3,692,232,073 nsec]
  dump_xml             [call:536]      [  6,177,606,232 nsec]
-----
Is there the method to make these processing early?

2010/6/14 <renayama19661014 at ybb.ne.jp>

> Hi Andrew,
>
> Thank you for comment.
>
> > More likely of the underlying messaging infrastructure, but I'll take a
> look.
> > Perhaps the default cib operation timeouts are too low for larger
> clusters.
> >
> > >
> > > The log attached it to next Bugzilla.
> > > �*
> http://developerbugs.linux-foundation.org/show_bug.cgi?id=2443
> >
> > Ok, I'll follow up there.
>
> If it is necessary for us to work for the solution of the problem, please
> order it.
>
> Best Regards,
> Hideo Yamauchi.
>
> --- Andrew Beekhof <andrew at beekhof.net> wrote:
>
> > On Mon, Jun 14, 2010 at 4:46 AM,  <renayama19661014 at ybb.ne.jp> wrote:
> > > We tested 16 node constitution (15+1).
> > >
> > > We carried out the next procedure.
> > >
> > > Step1) Start 16 nodes.
> > > Step2) Send cib after a DC node was decided.
> > >
> > > An error occurs by the update of the attribute of pingd after Probe
> processing was over.
> > >
> > >
> >
>
> ----------------------------------------------------------------------------------------------------------------------------------------
> > > Jun 14 10:58:03 hb0102 pingd: [2465]: info: ping_read: Retrying...
> > > Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update
> 337 for
> > default_ping_set=1600
> > > failed: Remote node did not respond
> > > Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update
> 340 for
> > default_ping_set=1600
> > > failed: Remote node did not respond
> > > Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update
> 343 for
> > default_ping_set=1600
> > > failed: Remote node did not respond
> > > Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update
> 346 for
> > default_ping_set=1600
> > > failed: Remote node did not respond
> > > Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update
> 349 for
> > default_ping_set=1600
> > > failed: Remote node did not respond
> > >
> >
>
> ----------------------------------------------------------------------------------------------------------------------------------------
> > >
> > > In the middle of this error, I carried out a cibadmin(-Q optin)
> command, but time-out
> > occurred.
> > > In addition, cib of the DC node seemed to move by the top command very
> busily.
> > >
> > >
> > > In addition, a communication error with cib occurs in the DC node, and
> crmd reboots.
> > >
> > >
> >
>
> ----------------------------------------------------------------------------------------------------------------------------------------
> > > Jun 14 10:58:09 hb0101 attrd: [2278]: WARN: xmlfromIPC: No message
> received in the required
> > interval
> > > (120s)
> > > Jun 14 10:58:09 hb0101 attrd: [2278]: info: attrd_perform_update: Sent
> update -41:
> > > default_ping_set=1600
> > > (snip)
> > > Jun 14 10:59:07 hb0101 crmd: [2280]: info: do_exit: [crmd] stopped (2)
> > > Jun 14 10:59:07 hb0101 corosync[2269]: � [pcmk �]
> plugin.c:858 info: pcmk_ipc_exit:
> Client
> > crmd
> > > (conn=0x106a2bf0, async-conn=0x106a2bf0) left
> > > Jun 14 10:59:08 hb0101 corosync[2269]: � [pcmk �]
> plugin.c:481 ERROR:
> pcmk_wait_dispatch:
> > Child
> > > process crmd exited (pid=2280, rc=2)
> > > Jun 14 10:59:08 hb0101 corosync[2269]: � [pcmk �]
> plugin.c:498 notice:
> pcmk_wait_dispatch:
> > Respawning
> > > failed child process: crmd
> > > Jun 14 10:59:08 hb0101 corosync[2269]: � [pcmk �]
> utils.c:131 info: spawn_child:
> Forked child
> > 2680 for
> > > process crmd
> > > Jun 14 10:59:08 hb0101 crmd: [2680]: info: Invoked:
> /usr/lib64/heartbeat/crmd
> > > Jun 14 10:59:08 hb0101 crmd: [2680]: info: main: CRM Hg Version:
> > > 9f04fa88cfd3da553e977cc79983d1c494c8b502
> > > Jun 14 10:59:08 hb0101 crmd: [2680]: info: crmd_init: Starting crmd
> > > Jun 14 10:59:08 hb0101 crmd: [2680]: info: G_main_add_SignalHandler:
> Added signal handler for
> > signal
> > > 17
> > >
> >
>
> ----------------------------------------------------------------------------------------------------------------------------------------
> > >
> > > There seems to be a problem in cib of the DC node somehow or other.
> > > We hope that an attribute change is completed in 16 nodes definitely.
> > > �* Is this phenomenon a limit of the current cib process?
> >
> > More likely of the underlying messaging infrastructure, but I'll take a
> look.
> > Perhaps the default cib operation timeouts are too low for larger
> clusters.
> >
> > >
> > > The log attached it to next Bugzilla.
> > > �*
> http://developerbugs.linux-foundation.org/show_bug.cgi?id=2443
> >
> > Ok, I'll follow up there.
> >
> > >
> > > Best Regards,
> > > Hideo Yamauchi.
> > >
> > >
> > >
> > > _______________________________________________
> > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> > >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100803/78a824b0/attachment.html>