[ClusterLabs] corosync eats the whole CPU core in epoll_wait() on one node in cluster
Andrew Beekhof
andrew at beekhof.net
Mon Jun 1 09:00:58 UTC 2015
> On 1 Jun 2015, at 5:21 pm, Jan Friesse <jfriesse at redhat.com> wrote:
>
> Vladislav
>
> Vladislav Bogdanov napsal(a):
>> Hi,
>>
>> Just noticed subj on just one node in 4-node cluster.
>>
>> I've dumped blackbox logs, but unfortunately that didn't help me to
>> understand what's going on because even debug logs are too slender.
>
> Do you still have them? Because maybe few lines from them may be helpful.
>
>
>>
>> strace on a running process doesn't show anything except epoll_wait.
>> ...
>> epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107) = 1
>> epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107) = 1
>> epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107) = 1
>> epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107) = 1
>> ...
>>
>> But that ones are way to frequent:
>> # timeout 10 strace -p 2177 2>&1 | grep EPOLLIN >/tmp/corosync-epoll.log
>> Terminated
>> # wc -l /tmp/corosync-epoll.log
>> 438399 /tmp/corosync-epoll.log
>>
>> that means: ~43840 times per second.
>>
>> Other nodes show zero.
>> Pacemaker DC is on the another node.
>>
>> Nodes are completely identical.
>>
>> fd 19 which generates that events is shown in lsof this way:
>> corosync 2177 root 19u unix 0xffff88062f896680 0t0
>> 17987 socket
>>
>> netstat for that inode (17987) shows:
>> unix 3 [ ] STREAM CONNECTED 17987 2177/corosync
>> @cpg
>>
>> So that socket is used by CPG.
>>
>> nearest socket inode (connecting one, 17986) is used by pacemakerd.
>>
>> strace of pacemakerd shows absolutely normal
>> poll([{fd=8, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5,
>> events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}], 4, 500) = 0 (Timeout)
>> poll([{fd=8, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5,
>> events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}], 4, 500) = 0 (Timeout)
>> poll([{fd=8, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5,
>> events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}], 4, 500) = 0 (Timeout)
>>
>
> AFAIK pacemaker doesn't use libqb qb_loop but corosync does. That's the reason.
The reason for?
>
>> So, this looks like a defect, but where?
>> libqb seems to be the main suspect, but I'm not sure.
>>
>> That is centos6, corosync 53f67a2 on top of libqb 0.17.1 (recompile of
>> David's 0.17.1-1 dated Tue Aug 26 2014).
>> Pacemaker is fbc239b.
>
> Are you able to reproduce bug after corosync/node restart? If so, could you try libqb master? There was bug in libqb https://github.com/ClusterLabs/libqb/pull/147 in qb_loop part so maybe related.
>
> Regards,
> Honza
>
>
>>
>> Best,
>> Vladislav
>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list