[Pacemaker] What is the reason which the node in which failure has not occurred carries out "lost"?

Thu Feb 20 08:28:26 UTC 2014

On 20 Feb 2014, at 6:06 pm, yusuke iida <yusk.iida at gmail.com> wrote:

> Hi, Andrew
> 
> I tested in the following environments.
> 
> KVM virtual 16 machines
> CPU: 1
> memory: 2048MB
> OS: RHEL6.4
> Pacemaker-1.1.11(709b36b)
> corosync-2.3.2
> libqb-0.16.0
> 
> It looks like performance is much better on the whole.
> 
> However, the problem to which queue overflows with some nodes during
> the test of 16 nodes arose.
> It happened by vm01 and vm09.
> 
> Overflow of queue of vm01 has taken place between cib and crm_mon.
> eb 20 14:21:02 [16211] vm01        cib: (       ipc.c:506   )   trace:
> crm_ipcs_flush_events:  Sent 40 events (729 remaining) for
> 0x1cd1850[16243]: Resource temporarily unavailable (-11)
> Feb 20 14:21:02 [16211] vm01        cib: (       ipc.c:515   )
> error: crm_ipcs_flush_events:  Evicting slow client 0x1cd1850[16243]:
> event queue reached 729 entries

Who was pid 16243?
Doesn't look like a pacemaker daemon.

> 
> Overflow of queue of vm09 has taken place between cib and stonithd.
> Feb 20 14:20:22 [15519] vm09        cib: (       ipc.c:506   )
> trace: crm_ipcs_flush_events:  Sent 36 events (530 remaining) for
> 0x105ec10[15520]: Resource temporarily unavailable (-11)
> Feb 20 14:20:22 [15519] vm09        cib: (       ipc.c:515   )
> error: crm_ipcs_flush_events:  Evicting slow client 0x105ec10[15520]:
> event queue reached 530 entries
> 
> Although I checked the code of the problem part, it was not understood
> by which it would be solved.
> 
> Is it less likelihood of sending a message of 100 at a time?
> Does calculation of the waiting time after message transmission have a problem?
> Threshold of 500 may be too low?

being 500 behind is really quite a long way.

> 
> I attach crm_report when a problem occurs.
> https://drive.google.com/file/d/0BwMFJItoO-fVeGZuWkFnZTFWTDQ/edit?usp=sharing
> 
> Regards,
> Yusuke
> 2014-02-18 19:53 GMT+09:00 yusuke iida <yusk.iida at gmail.com>:
>> Hi, Andrew and Digimer
>> 
>> Thank you for the comment.
>> 
>> I solved with reference to other mailing list about this problem.
>> https://bugzilla.redhat.com/show_bug.cgi?id=880035
>> 
>> It seems that the kernel of my environment was old when said from the
>> conclusion.
>> It updated to the newest kernel now.
>> kernel-2.6.32-431.5.1.el6.x86_64.rpm
>> 
>> The following parameters are set to bridge which is letting
>> communication of corosync pass now.
>> As a result, "Retransmit List" no longer occur almost.
>> # echo 1 > /sys/class/net/<bridge>/bridge/multicast_querier
>> # echo 0 > /sys/class/net/<bridge>/bridge/multicast_snooping
>> 
>> 2014-02-18 9:49 GMT+09:00 Andrew Beekhof <andrew at beekhof.net>:
>>> 
>>> On 31 Jan 2014, at 6:20 pm, yusuke iida <yusk.iida at gmail.com> wrote:
>>> 
>>>> Hi, all
>>>> 
>>>> I measure the performance of Pacemaker in the following combinations.
>>>> Pacemaker-1.1.11.rc1
>>>> libqb-0.16.0
>>>> corosync-2.3.2
>>>> 
>>>> All nodes are KVM virtual machines.
>>>> 
>>>> stopped the node of vm01 compulsorily from the inside, after starting 14 nodes.
>>>> "virsh destroy vm01" was used for the stop.
>>>> Then, in addition to the compulsorily stopped node, other nodes are separated from a cluster.
>>>> 
>>>> The log of "Retransmit List:" is then outputted in large quantities from corosync.
>>> 
>>> Probably best to poke the corosync guys about this.
>>> 
>>> However, <= .11 is known to cause significant CPU usage with that many nodes.
>>> I can easily imagine this staving corosync of resources and causing breakage.
>>> 
>>> I would _highly_ recommend retesting with the current git master of pacemaker.
>>> I merged the new cib code last week which is faster by _two_ orders of magnitude and uses significantly less CPU.
>>> 
>>> I'd be interested to hear your feedback.
>> Since I am very interested in this, I would like to test, although the
>> problem of "Retransmit List" was solved.
>> Please wait for a result a little.
>> 
>> Thanks,
>> Yusuke
>> 
>>> 
>>>> 
>>>> What is the reason which the node in which failure has not occurred carries out "lost"?
>>>> 
>>>> Please advise, if there is a problem in a setup in something.
>>>> 
>>>> I attached the report when the problem occurred.
>>>> https://drive.google.com/file/d/0BwMFJItoO-fVMkFWWWlQQldsSFU/edit?usp=sharing
>>>> 
>>>> Regards,
>>>> Yusuke
>>>> --
>>>> ----------------------------------------
>>>> METRO SYSTEMS CO., LTD
>>>> 
>>>> Yusuke Iida
>>>> Mail: yusk.iida at gmail.com
>>>> ----------------------------------------
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> 
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>> 
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>> 
>> 
>> 
>> 
>> --
>> ----------------------------------------
>> METRO SYSTEMS CO., LTD
>> 
>> Yusuke Iida
>> Mail: yusk.iida at gmail.com
>> ----------------------------------------
> 
> 
> 
> -- 
> ----------------------------------------
> METRO SYSTEMS CO., LTD
> 
> Yusuke Iida
> Mail: yusk.iida at gmail.com
> ----------------------------------------
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140220/55fe796e/attachment-0004.sig>