[Pacemaker] What is the reason which the node in which failure has not occurred carries out "lost"?
Andrew Beekhof
andrew at beekhof.net
Thu Feb 20 08:28:26 UTC 2014
On 20 Feb 2014, at 6:06 pm, yusuke iida <yusk.iida at gmail.com> wrote:
> Hi, Andrew
>
> I tested in the following environments.
>
> KVM virtual 16 machines
> CPU: 1
> memory: 2048MB
> OS: RHEL6.4
> Pacemaker-1.1.11(709b36b)
> corosync-2.3.2
> libqb-0.16.0
>
> It looks like performance is much better on the whole.
>
> However, the problem to which queue overflows with some nodes during
> the test of 16 nodes arose.
> It happened by vm01 and vm09.
>
> Overflow of queue of vm01 has taken place between cib and crm_mon.
> eb 20 14:21:02 [16211] vm01 cib: ( ipc.c:506 ) trace:
> crm_ipcs_flush_events: Sent 40 events (729 remaining) for
> 0x1cd1850[16243]: Resource temporarily unavailable (-11)
> Feb 20 14:21:02 [16211] vm01 cib: ( ipc.c:515 )
> error: crm_ipcs_flush_events: Evicting slow client 0x1cd1850[16243]:
> event queue reached 729 entries
Who was pid 16243?
Doesn't look like a pacemaker daemon.
>
> Overflow of queue of vm09 has taken place between cib and stonithd.
> Feb 20 14:20:22 [15519] vm09 cib: ( ipc.c:506 )
> trace: crm_ipcs_flush_events: Sent 36 events (530 remaining) for
> 0x105ec10[15520]: Resource temporarily unavailable (-11)
> Feb 20 14:20:22 [15519] vm09 cib: ( ipc.c:515 )
> error: crm_ipcs_flush_events: Evicting slow client 0x105ec10[15520]:
> event queue reached 530 entries
>
> Although I checked the code of the problem part, it was not understood
> by which it would be solved.
>
> Is it less likelihood of sending a message of 100 at a time?
> Does calculation of the waiting time after message transmission have a problem?
> Threshold of 500 may be too low?
being 500 behind is really quite a long way.
>
> I attach crm_report when a problem occurs.
> https://drive.google.com/file/d/0BwMFJItoO-fVeGZuWkFnZTFWTDQ/edit?usp=sharing
>
> Regards,
> Yusuke
> 2014-02-18 19:53 GMT+09:00 yusuke iida <yusk.iida at gmail.com>:
>> Hi, Andrew and Digimer
>>
>> Thank you for the comment.
>>
>> I solved with reference to other mailing list about this problem.
>> https://bugzilla.redhat.com/show_bug.cgi?id=880035
>>
>> It seems that the kernel of my environment was old when said from the
>> conclusion.
>> It updated to the newest kernel now.
>> kernel-2.6.32-431.5.1.el6.x86_64.rpm
>>
>> The following parameters are set to bridge which is letting
>> communication of corosync pass now.
>> As a result, "Retransmit List" no longer occur almost.
>> # echo 1 > /sys/class/net/<bridge>/bridge/multicast_querier
>> # echo 0 > /sys/class/net/<bridge>/bridge/multicast_snooping
>>
>> 2014-02-18 9:49 GMT+09:00 Andrew Beekhof <andrew at beekhof.net>:
>>>
>>> On 31 Jan 2014, at 6:20 pm, yusuke iida <yusk.iida at gmail.com> wrote:
>>>
>>>> Hi, all
>>>>
>>>> I measure the performance of Pacemaker in the following combinations.
>>>> Pacemaker-1.1.11.rc1
>>>> libqb-0.16.0
>>>> corosync-2.3.2
>>>>
>>>> All nodes are KVM virtual machines.
>>>>
>>>> stopped the node of vm01 compulsorily from the inside, after starting 14 nodes.
>>>> "virsh destroy vm01" was used for the stop.
>>>> Then, in addition to the compulsorily stopped node, other nodes are separated from a cluster.
>>>>
>>>> The log of "Retransmit List:" is then outputted in large quantities from corosync.
>>>
>>> Probably best to poke the corosync guys about this.
>>>
>>> However, <= .11 is known to cause significant CPU usage with that many nodes.
>>> I can easily imagine this staving corosync of resources and causing breakage.
>>>
>>> I would _highly_ recommend retesting with the current git master of pacemaker.
>>> I merged the new cib code last week which is faster by _two_ orders of magnitude and uses significantly less CPU.
>>>
>>> I'd be interested to hear your feedback.
>> Since I am very interested in this, I would like to test, although the
>> problem of "Retransmit List" was solved.
>> Please wait for a result a little.
>>
>> Thanks,
>> Yusuke
>>
>>>
>>>>
>>>> What is the reason which the node in which failure has not occurred carries out "lost"?
>>>>
>>>> Please advise, if there is a problem in a setup in something.
>>>>
>>>> I attached the report when the problem occurred.
>>>> https://drive.google.com/file/d/0BwMFJItoO-fVMkFWWWlQQldsSFU/edit?usp=sharing
>>>>
>>>> Regards,
>>>> Yusuke
>>>> --
>>>> ----------------------------------------
>>>> METRO SYSTEMS CO., LTD
>>>>
>>>> Yusuke Iida
>>>> Mail: yusk.iida at gmail.com
>>>> ----------------------------------------
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>>
>> --
>> ----------------------------------------
>> METRO SYSTEMS CO., LTD
>>
>> Yusuke Iida
>> Mail: yusk.iida at gmail.com
>> ----------------------------------------
>
>
>
> --
> ----------------------------------------
> METRO SYSTEMS CO., LTD
>
> Yusuke Iida
> Mail: yusk.iida at gmail.com
> ----------------------------------------
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140220/55fe796e/attachment-0004.sig>
More information about the Pacemaker
mailing list