[ClusterLabs] Antw: Re: corosync eats the whole CPU core in epoll_wait() on one node in cluster

Tue Jun 2 08:03:50 UTC 2015

Vladislav Bogdanov napsal(a):
> 02.06.2015 09:17, Ulrich Windl wrote:
>>>>> Jan Friesse <jfriesse at redhat.com> schrieb am 01.06.2015 um 16:20 in
>>>>> Nachricht
>> <556C6A19.5080007 at redhat.com>:
>>
>> [...you cut the part where it seems like polling an invalid file
>> descriptor...]
>>
>>> strace of pacemakerd shows absolutely normal.
>>
>> [...]
>>
>> If you wait for I/O on an invalid file dewcriptior, you can busy the
>> CPU quite easily. Usually this is the case where querying ERRNO to
>> quit a loop helps ;-)
>>
> 
> Yep, invalid/disabled fd could be the root of the issue, but I'd like to
> make sure that either I hit #147 (pe->state == QB_POLL_ENTRY_DELETED) or
> it is completely different issue.
> 
> There is no reproducer code/path available in #147, so I'm unable to
> compare strace outputs with it.

AFAIK Dave was talking about creating too many connections (not in
parallel, just open/close). So you can try something simple like "while
true;do corosync-cmapctl;done". In theory, bug should reproduce (not on
cpg but on cmap socket).

Honza

> 
>> Not saying I diagnosed the proplem correctly, but that was my first
>> impression.
>>
>> Regards,
>> Ulrich
>>
>>
>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org