[ClusterLabs] corosync eats the whole CPU core in epoll_wait() on one node in cluster

Mon Jun 1 07:21:06 UTC 2015

Vladislav

Vladislav Bogdanov napsal(a):
> Hi,
>
> Just noticed subj on just one node in 4-node cluster.
>
> I've dumped blackbox logs, but unfortunately that didn't help me to
> understand what's going on because even debug logs are too slender.

Do you still have them? Because maybe few lines from them may be helpful.

>
> strace on a running process doesn't show anything except epoll_wait.
> ...
> epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107) = 1
> epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107) = 1
> epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107) = 1
> epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107) = 1
> ...
>
> But that ones are way to frequent:
> # timeout 10 strace -p 2177 2>&1 | grep EPOLLIN >/tmp/corosync-epoll.log
> Terminated
> # wc -l /tmp/corosync-epoll.log
> 438399 /tmp/corosync-epoll.log
>
> that means: ~43840 times per second.
>
> Other nodes show zero.
> Pacemaker DC is on the another node.
>
> Nodes are completely identical.
>
> fd 19 which generates that events is shown in lsof this way:
> corosync   2177      root   19u     unix 0xffff88062f896680          0t0
>       17987 socket
>
> netstat for that inode (17987) shows:
> unix  3      [ ]         STREAM     CONNECTED     17987  2177/corosync
>       @cpg
>
> So that socket is used by CPG.
>
> nearest socket inode (connecting one, 17986) is used by pacemakerd.
>
> strace of pacemakerd shows absolutely normal
> poll([{fd=8, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5,
> events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}], 4, 500) = 0 (Timeout)
> poll([{fd=8, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5,
> events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}], 4, 500) = 0 (Timeout)
> poll([{fd=8, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5,
> events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}], 4, 500) = 0 (Timeout)
>

AFAIK pacemaker doesn't use libqb qb_loop but corosync does. That's the 
reason.

> So, this looks like a defect, but where?
> libqb seems to be the main suspect, but I'm not sure.
>
> That is centos6, corosync 53f67a2 on top of libqb 0.17.1 (recompile of
> David's 0.17.1-1 dated Tue Aug 26 2014).
> Pacemaker is fbc239b.

Are you able to reproduce bug after corosync/node restart? If so, could 
you try libqb master? There was bug in libqb 
https://github.com/ClusterLabs/libqb/pull/147 in qb_loop part so maybe 
related.

Regards,
   Honza

>
> Best,
> Vladislav
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org