[ClusterLabs] corosync eats the whole CPU core in epoll_wait() on one node in cluster

Jan Friesse jfriesse at redhat.com
Mon Jun 1 14:20:09 UTC 2015


Andrew Beekhof napsal(a):
>
>> On 1 Jun 2015, at 5:21 pm, Jan Friesse <jfriesse at redhat.com> wrote:
>>
>> Vladislav
>>
>> Vladislav Bogdanov napsal(a):
>>> Hi,
>>>
>>> Just noticed subj on just one node in 4-node cluster.
>>>
>>> I've dumped blackbox logs, but unfortunately that didn't help me to
>>> understand what's going on because even debug logs are too slender.
>>
>> Do you still have them? Because maybe few lines from them may be helpful.
>>
>>
>>>
>>> strace on a running process doesn't show anything except epoll_wait.
>>> ...
>>> epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107) = 1
>>> epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107) = 1
>>> epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107) = 1
>>> epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107) = 1
>>> ...
>>>
>>> But that ones are way to frequent:
>>> # timeout 10 strace -p 2177 2>&1 | grep EPOLLIN >/tmp/corosync-epoll.log
>>> Terminated
>>> # wc -l /tmp/corosync-epoll.log
>>> 438399 /tmp/corosync-epoll.log
>>>
>>> that means: ~43840 times per second.
>>>
>>> Other nodes show zero.
>>> Pacemaker DC is on the another node.
>>>
>>> Nodes are completely identical.
>>>
>>> fd 19 which generates that events is shown in lsof this way:
>>> corosync   2177      root   19u     unix 0xffff88062f896680          0t0
>>>       17987 socket
>>>
>>> netstat for that inode (17987) shows:
>>> unix  3      [ ]         STREAM     CONNECTED     17987  2177/corosync
>>>       @cpg
>>>
>>> So that socket is used by CPG.
>>>
>>> nearest socket inode (connecting one, 17986) is used by pacemakerd.
>>>
>>> strace of pacemakerd shows absolutely normal
>>> poll([{fd=8, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5,
>>> events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}], 4, 500) = 0 (Timeout)
>>> poll([{fd=8, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5,
>>> events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}], 4, 500) = 0 (Timeout)
>>> poll([{fd=8, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5,
>>> events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}], 4, 500) = 0 (Timeout)
>>>
>>
>> AFAIK pacemaker doesn't use libqb qb_loop but corosync does. That's the reason.
>
> The reason for?

strace of pacemakerd shows absolutely normal.

>
>>
>>> So, this looks like a defect, but where?
>>> libqb seems to be the main suspect, but I'm not sure.
>>>
>>> That is centos6, corosync 53f67a2 on top of libqb 0.17.1 (recompile of
>>> David's 0.17.1-1 dated Tue Aug 26 2014).
>>> Pacemaker is fbc239b.
>>
>> Are you able to reproduce bug after corosync/node restart? If so, could you try libqb master? There was bug in libqb https://github.com/ClusterLabs/libqb/pull/147 in qb_loop part so maybe related.
>>
>> Regards,
>>   Honza
>>
>>
>>>
>>> Best,
>>> Vladislav
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>





More information about the Users mailing list