[ClusterLabs] Antw: Re: corosync eats the whole CPU core in epoll_wait() on one node in cluster

Tue Jun 2 09:24:54 UTC 2015

Hi Honza,

02.06.2015 11:03, Jan Friesse wrote:
> Vladislav Bogdanov napsal(a):
>> 02.06.2015 09:17, Ulrich Windl wrote:
>>>>>> Jan Friesse <jfriesse at redhat.com> schrieb am 01.06.2015 um 16:20 in
>>>>>> Nachricht
>>> <556C6A19.5080007 at redhat.com>:
>>>
>>> [...you cut the part where it seems like polling an invalid file
>>> descriptor...]
>>>
>>>> strace of pacemakerd shows absolutely normal.
>>>
>>> [...]
>>>
>>> If you wait for I/O on an invalid file dewcriptior, you can busy the
>>> CPU quite easily. Usually this is the case where querying ERRNO to
>>> quit a loop helps ;-)
>>>
>>
>> Yep, invalid/disabled fd could be the root of the issue, but I'd like to
>> make sure that either I hit #147 (pe->state == QB_POLL_ENTRY_DELETED) or
>> it is completely different issue.
>>
>> There is no reproducer code/path available in #147, so I'm unable to
>> compare strace outputs with it.
> 
> AFAIK Dave was talking about creating too many connections (not in
> parallel, just open/close). So you can try something simple like "while
> true;do corosync-cmapctl;done". In theory, bug should reproduce (not on
> cpg but on cmap socket).

I took another round, giving oprofile a chance.
Here are the (top lines) output of 'opreport -lg':

samples  %        linenr info                 image name               symbol name
106415   32.2366  (no location information)   no-vmlinux               /no-vmlinux
55839    16.9155  loop_poll.c:142             libqb.so.0.17.1          qb_poll_fds_usage_check_
52571    15.9255  array.c:103                 libqb.so.0.17.1          qb_array_index
17756     5.3789  (no location information)   [vdso] (tgid:2177 range:0x7fffae5ee000-0x7fffae5eefff) [vdso] (tgid:2177 range:0x7fffae5ee000-0x7fffae5eefff)
9690      2.9354  loop.c:140                  libqb.so.0.17.1          qb_loop_run
8954      2.7125  (no location information)   libc-2.12.so             __epoll_wait_nocancel
7712      2.3362  loop_timerlist.c:84         libqb.so.0.17.1          qb_loop_timer_msec_duration_to_expire
7430      2.2508  ringbuffer.c:624            libqb.so.0.17.1          qb_rb_chunk_peek
7300      2.2114  loop_poll_epoll.c:144       libqb.so.0.17.1          _poll_and_add_to_jobs_
6928      2.0987  (no location information)   librt-2.12.so            clock_gettime
5849      1.7719  ipcs.c:744                  libqb.so.0.17.1          qb_ipcs_dispatch_connection_request
5630      1.7055  (no location information)   libc-2.12.so             __libc_disable_asynccancel
...

Cannot get too deep into libqb code right now, but at the first glance everything
seems to be consistent with what #147 fixes.

Do I miss something?

Best,
Vladislav

> 
> Honza
> 
>>
>>> Not saying I diagnosed the proplem correctly, but that was my first
>>> impression.
>>>
>>> Regards,
>>> Ulrich
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>