[ClusterLabs] corosync eats the whole CPU core in epoll_wait() on one node in cluster

Mon Jun 1 09:00:58 UTC 2015

> On 1 Jun 2015, at 5:21 pm, Jan Friesse <jfriesse at redhat.com> wrote:
> 
> Vladislav
> 
> Vladislav Bogdanov napsal(a):
>> Hi,
>> 
>> Just noticed subj on just one node in 4-node cluster.
>> 
>> I've dumped blackbox logs, but unfortunately that didn't help me to
>> understand what's going on because even debug logs are too slender.
> 
> Do you still have them? Because maybe few lines from them may be helpful.
> 
> 
>> 
>> strace on a running process doesn't show anything except epoll_wait.
>> ...
>> epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107) = 1
>> epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107) = 1
>> epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107) = 1
>> epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107) = 1
>> ...
>> 
>> But that ones are way to frequent:
>> # timeout 10 strace -p 2177 2>&1 | grep EPOLLIN >/tmp/corosync-epoll.log
>> Terminated
>> # wc -l /tmp/corosync-epoll.log
>> 438399 /tmp/corosync-epoll.log
>> 
>> that means: ~43840 times per second.
>> 
>> Other nodes show zero.
>> Pacemaker DC is on the another node.
>> 
>> Nodes are completely identical.
>> 
>> fd 19 which generates that events is shown in lsof this way:
>> corosync   2177      root   19u     unix 0xffff88062f896680          0t0
>>      17987 socket
>> 
>> netstat for that inode (17987) shows:
>> unix  3      [ ]         STREAM     CONNECTED     17987  2177/corosync
>>      @cpg
>> 
>> So that socket is used by CPG.
>> 
>> nearest socket inode (connecting one, 17986) is used by pacemakerd.
>> 
>> strace of pacemakerd shows absolutely normal
>> poll([{fd=8, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5,
>> events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}], 4, 500) = 0 (Timeout)
>> poll([{fd=8, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5,
>> events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}], 4, 500) = 0 (Timeout)
>> poll([{fd=8, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5,
>> events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}], 4, 500) = 0 (Timeout)
>> 
> 
> AFAIK pacemaker doesn't use libqb qb_loop but corosync does. That's the reason.

The reason for?

> 
>> So, this looks like a defect, but where?
>> libqb seems to be the main suspect, but I'm not sure.
>> 
>> That is centos6, corosync 53f67a2 on top of libqb 0.17.1 (recompile of
>> David's 0.17.1-1 dated Tue Aug 26 2014).
>> Pacemaker is fbc239b.
> 
> Are you able to reproduce bug after corosync/node restart? If so, could you try libqb master? There was bug in libqb https://github.com/ClusterLabs/libqb/pull/147 in qb_loop part so maybe related.
> 
> Regards,
>  Honza
> 
> 
>> 
>> Best,
>> Vladislav
>> 
>> 
>> 
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org