[ClusterLabs] Antw: Re: corosync eats the whole CPU core in epoll_wait() on one node in cluster
Vladislav Bogdanov
bubble at hoster-ok.com
Tue Jun 2 09:24:54 UTC 2015
Hi Honza,
02.06.2015 11:03, Jan Friesse wrote:
> Vladislav Bogdanov napsal(a):
>> 02.06.2015 09:17, Ulrich Windl wrote:
>>>>>> Jan Friesse <jfriesse at redhat.com> schrieb am 01.06.2015 um 16:20 in
>>>>>> Nachricht
>>> <556C6A19.5080007 at redhat.com>:
>>>
>>> [...you cut the part where it seems like polling an invalid file
>>> descriptor...]
>>>
>>>> strace of pacemakerd shows absolutely normal.
>>>
>>> [...]
>>>
>>> If you wait for I/O on an invalid file dewcriptior, you can busy the
>>> CPU quite easily. Usually this is the case where querying ERRNO to
>>> quit a loop helps ;-)
>>>
>>
>> Yep, invalid/disabled fd could be the root of the issue, but I'd like to
>> make sure that either I hit #147 (pe->state == QB_POLL_ENTRY_DELETED) or
>> it is completely different issue.
>>
>> There is no reproducer code/path available in #147, so I'm unable to
>> compare strace outputs with it.
>
> AFAIK Dave was talking about creating too many connections (not in
> parallel, just open/close). So you can try something simple like "while
> true;do corosync-cmapctl;done". In theory, bug should reproduce (not on
> cpg but on cmap socket).
I took another round, giving oprofile a chance.
Here are the (top lines) output of 'opreport -lg':
samples % linenr info image name symbol name
106415 32.2366 (no location information) no-vmlinux /no-vmlinux
55839 16.9155 loop_poll.c:142 libqb.so.0.17.1 qb_poll_fds_usage_check_
52571 15.9255 array.c:103 libqb.so.0.17.1 qb_array_index
17756 5.3789 (no location information) [vdso] (tgid:2177 range:0x7fffae5ee000-0x7fffae5eefff) [vdso] (tgid:2177 range:0x7fffae5ee000-0x7fffae5eefff)
9690 2.9354 loop.c:140 libqb.so.0.17.1 qb_loop_run
8954 2.7125 (no location information) libc-2.12.so __epoll_wait_nocancel
7712 2.3362 loop_timerlist.c:84 libqb.so.0.17.1 qb_loop_timer_msec_duration_to_expire
7430 2.2508 ringbuffer.c:624 libqb.so.0.17.1 qb_rb_chunk_peek
7300 2.2114 loop_poll_epoll.c:144 libqb.so.0.17.1 _poll_and_add_to_jobs_
6928 2.0987 (no location information) librt-2.12.so clock_gettime
5849 1.7719 ipcs.c:744 libqb.so.0.17.1 qb_ipcs_dispatch_connection_request
5630 1.7055 (no location information) libc-2.12.so __libc_disable_asynccancel
...
Cannot get too deep into libqb code right now, but at the first glance everything
seems to be consistent with what #147 fixes.
Do I miss something?
Best,
Vladislav
>
> Honza
>
>>
>>> Not saying I diagnosed the proplem correctly, but that was my first
>>> impression.
>>>
>>> Regards,
>>> Ulrich
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
More information about the Users
mailing list