[ClusterLabs] corosync eats the whole CPU core in epoll_wait() on one node in cluster

Vladislav Bogdanov bubble at hoster-ok.com
Fri May 29 08:25:15 EDT 2015


Hi,

Just noticed subj on just one node in 4-node cluster.

I've dumped blackbox logs, but unfortunately that didn't help me to 
understand what's going on because even debug logs are too slender.

strace on a running process doesn't show anything except epoll_wait.
...
epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107) = 1
epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107) = 1
epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107) = 1
epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107) = 1
...

But that ones are way to frequent:
# timeout 10 strace -p 2177 2>&1 | grep EPOLLIN >/tmp/corosync-epoll.log
Terminated
# wc -l /tmp/corosync-epoll.log
438399 /tmp/corosync-epoll.log

that means: ~43840 times per second.

Other nodes show zero.
Pacemaker DC is on the another node.

Nodes are completely identical.

fd 19 which generates that events is shown in lsof this way:
corosync   2177      root   19u     unix 0xffff88062f896680          0t0 
      17987 socket

netstat for that inode (17987) shows:
unix  3      [ ]         STREAM     CONNECTED     17987  2177/corosync 
      @cpg

So that socket is used by CPG.

nearest socket inode (connecting one, 17986) is used by pacemakerd.

strace of pacemakerd shows absolutely normal
poll([{fd=8, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5, 
events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}], 4, 500) = 0 (Timeout)
poll([{fd=8, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5, 
events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}], 4, 500) = 0 (Timeout)
poll([{fd=8, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5, 
events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}], 4, 500) = 0 (Timeout)

So, this looks like a defect, but where?
libqb seems to be the main suspect, but I'm not sure.

That is centos6, corosync 53f67a2 on top of libqb 0.17.1 (recompile of 
David's 0.17.1-1 dated Tue Aug 26 2014).
Pacemaker is fbc239b.

Best,
Vladislav






More information about the Users mailing list