[ClusterLabs] corosync eats the whole CPU core in epoll_wait() on one node in cluster

Vladislav Bogdanov bubble at hoster-ok.com
Mon Jun 1 10:09:05 UTC 2015


Hi Honza,

01.06.2015 10:21, Jan Friesse wrote:
> Vladislav
> 
> Vladislav Bogdanov napsal(a):
>> Hi,
>>
>> Just noticed subj on just one node in 4-node cluster.
>>
>> I've dumped blackbox logs, but unfortunately that didn't help me to
>> understand what's going on because even debug logs are too slender.
> 
> Do you still have them? Because maybe few lines from them may be helpful.

Yep, here is the tail for the last 5 min. Do you need the whole thing?

debug   Jun 01 09:34:01 message_handler_req_lib_quorum_getquorate(396):2: got quorate request on 0x7f9434f12390
debug   Jun 01 09:35:01 message_handler_req_lib_quorum_getquorate(396):2: got quorate request on 0x7f9434f12390
debug   Jun 01 09:36:01 message_handler_req_lib_quorum_getquorate(396):2: got quorate request on 0x7f9434f12390
debug   Jun 01 09:37:01 message_handler_req_lib_quorum_getquorate(396):2: got quorate request on 0x7f9434f12390
debug   Jun 01 09:37:54 handle_new_connection(486):4: IPC credentials authenticated (2177-26592-33)
debug   Jun 01 09:37:54 qb_ipcs_shm_connect(294):4: connecting to client [26592]
debug   Jun 01 09:37:54 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug   Jun 01 09:37:54 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug   Jun 01 09:37:54 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug   Jun 01 09:37:54 cs_ipcs_connection_created(272):1: connection created
debug   Jun 01 09:37:54 cpg_lib_init_fn(1637):7: lib_init_fn: conn=0x7f9434f16230, cpd=0x7f9434f169b4
debug   Jun 01 09:37:54 message_handler_req_lib_cpg_finalize(1760):7: cpg finalize for conn=0x7f9434f16230
debug   Jun 01 09:37:54 qb_ipcs_dispatch_connection_request(758):4: HUP conn (2177-26592-33)
debug   Jun 01 09:37:54 qb_ipcs_disconnect(605):4: qb_ipcs_disconnect(2177-26592-33) state:2
debug   Jun 01 09:37:54 _del(117):4: epoll_ctl(del): Bad file descriptor (9)
debug   Jun 01 09:37:54 cs_ipcs_connection_closed(417):1: cs_ipcs_connection_closed() 
debug   Jun 01 09:37:54 cpg_lib_exit_fn(1131):7: exit_fn for conn=0x7f9434f16230
debug   Jun 01 09:37:54 cs_ipcs_connection_destroyed(390):1: cs_ipcs_connection_destroyed() 
trace   Jun 01 09:37:54 qb_rb_close(288):4: ENTERING qb_rb_close()
trace   Jun 01 09:37:54 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug   Jun 01 09:37:54 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cpg-response-2177-26592-33-header
trace   Jun 01 09:37:54 qb_rb_close(288):4: ENTERING qb_rb_close()
trace   Jun 01 09:37:54 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug   Jun 01 09:37:54 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cpg-event-2177-26592-33-header
trace   Jun 01 09:37:54 qb_rb_close(288):4: ENTERING qb_rb_close()
trace   Jun 01 09:37:54 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug   Jun 01 09:37:54 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cpg-request-2177-26592-33-header
debug   Jun 01 09:37:54 handle_new_connection(486):4: IPC credentials authenticated (2177-26592-33)
debug   Jun 01 09:37:54 qb_ipcs_shm_connect(294):4: connecting to client [26592]
debug   Jun 01 09:37:54 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug   Jun 01 09:37:54 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug   Jun 01 09:37:54 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug   Jun 01 09:37:54 cs_ipcs_connection_created(272):1: connection created
debug   Jun 01 09:37:54 cmap_lib_init_fn(306):8: lib_init_fn: conn=0x7f9434f16230
debug   Jun 01 09:37:54 qb_ipcs_dispatch_connection_request(758):4: HUP conn (2177-26592-33)
debug   Jun 01 09:37:54 qb_ipcs_disconnect(605):4: qb_ipcs_disconnect(2177-26592-33) state:2
debug   Jun 01 09:37:54 _del(117):4: epoll_ctl(del): Bad file descriptor (9)
debug   Jun 01 09:37:54 cs_ipcs_connection_closed(417):1: cs_ipcs_connection_closed() 
debug   Jun 01 09:37:54 cmap_lib_exit_fn(325):8: exit_fn for conn=0x7f9434f16230
debug   Jun 01 09:37:54 cs_ipcs_connection_destroyed(390):1: cs_ipcs_connection_destroyed() 
trace   Jun 01 09:37:54 qb_rb_close(288):4: ENTERING qb_rb_close()
trace   Jun 01 09:37:54 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug   Jun 01 09:37:54 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cmap-response-2177-26592-33-header
trace   Jun 01 09:37:54 qb_rb_close(288):4: ENTERING qb_rb_close()
trace   Jun 01 09:37:54 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug   Jun 01 09:37:54 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cmap-event-2177-26592-33-header
trace   Jun 01 09:37:54 qb_rb_close(288):4: ENTERING qb_rb_close()
trace   Jun 01 09:37:54 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug   Jun 01 09:37:54 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cmap-request-2177-26592-33-header
debug   Jun 01 09:37:54 handle_new_connection(486):4: IPC credentials authenticated (2177-26594-33)
debug   Jun 01 09:37:54 qb_ipcs_shm_connect(294):4: connecting to client [26594]
debug   Jun 01 09:37:54 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug   Jun 01 09:37:54 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug   Jun 01 09:37:54 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug   Jun 01 09:37:54 cs_ipcs_connection_created(272):1: connection created
debug   Jun 01 09:37:54 cpg_lib_init_fn(1637):7: lib_init_fn: conn=0x7f9434f16230, cpd=0x7f9434f169b4
debug   Jun 01 09:37:54 message_handler_req_lib_cpg_finalize(1760):7: cpg finalize for conn=0x7f9434f16230
debug   Jun 01 09:37:54 qb_ipcs_dispatch_connection_request(758):4: HUP conn (2177-26594-33)
debug   Jun 01 09:37:54 qb_ipcs_disconnect(605):4: qb_ipcs_disconnect(2177-26594-33) state:2
debug   Jun 01 09:37:54 _del(117):4: epoll_ctl(del): Bad file descriptor (9)
debug   Jun 01 09:37:54 cs_ipcs_connection_closed(417):1: cs_ipcs_connection_closed() 
debug   Jun 01 09:37:54 cpg_lib_exit_fn(1131):7: exit_fn for conn=0x7f9434f16230
debug   Jun 01 09:37:54 cs_ipcs_connection_destroyed(390):1: cs_ipcs_connection_destroyed() 
trace   Jun 01 09:37:54 qb_rb_close(288):4: ENTERING qb_rb_close()
trace   Jun 01 09:37:54 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug   Jun 01 09:37:54 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cpg-response-2177-26594-33-header
trace   Jun 01 09:37:54 qb_rb_close(288):4: ENTERING qb_rb_close()
trace   Jun 01 09:37:54 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug   Jun 01 09:37:54 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cpg-event-2177-26594-33-header
trace   Jun 01 09:37:54 qb_rb_close(288):4: ENTERING qb_rb_close()
trace   Jun 01 09:37:54 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug   Jun 01 09:37:54 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cpg-request-2177-26594-33-header
debug   Jun 01 09:37:54 handle_new_connection(486):4: IPC credentials authenticated (2177-26594-33)
debug   Jun 01 09:37:54 qb_ipcs_shm_connect(294):4: connecting to client [26594]
debug   Jun 01 09:37:54 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug   Jun 01 09:37:54 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug   Jun 01 09:37:54 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug   Jun 01 09:37:54 cs_ipcs_connection_created(272):1: connection created
debug   Jun 01 09:37:54 cmap_lib_init_fn(306):8: lib_init_fn: conn=0x7f9434f16230
debug   Jun 01 09:37:54 qb_ipcs_dispatch_connection_request(758):4: HUP conn (2177-26594-33)
debug   Jun 01 09:37:54 qb_ipcs_disconnect(605):4: qb_ipcs_disconnect(2177-26594-33) state:2
debug   Jun 01 09:37:54 _del(117):4: epoll_ctl(del): Bad file descriptor (9)
debug   Jun 01 09:37:54 cs_ipcs_connection_closed(417):1: cs_ipcs_connection_closed() 
debug   Jun 01 09:37:54 cmap_lib_exit_fn(325):8: exit_fn for conn=0x7f9434f16230
debug   Jun 01 09:37:54 cs_ipcs_connection_destroyed(390):1: cs_ipcs_connection_destroyed() 
trace   Jun 01 09:37:54 qb_rb_close(288):4: ENTERING qb_rb_close()
trace   Jun 01 09:37:54 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug   Jun 01 09:37:54 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cmap-response-2177-26594-33-header
trace   Jun 01 09:37:54 qb_rb_close(288):4: ENTERING qb_rb_close()
trace   Jun 01 09:37:54 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug   Jun 01 09:37:54 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cmap-event-2177-26594-33-header
trace   Jun 01 09:37:54 qb_rb_close(288):4: ENTERING qb_rb_close()
trace   Jun 01 09:37:54 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug   Jun 01 09:37:54 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cmap-request-2177-26594-33-header
debug   Jun 01 09:38:01 message_handler_req_lib_quorum_getquorate(396):2: got quorate request on 0x7f9434f12390
debug   Jun 01 09:38:34 handle_new_connection(486):4: IPC credentials authenticated (2177-28531-33)
debug   Jun 01 09:38:34 qb_ipcs_shm_connect(294):4: connecting to client [28531]
debug   Jun 01 09:38:34 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug   Jun 01 09:38:34 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug   Jun 01 09:38:34 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug   Jun 01 09:38:34 cs_ipcs_connection_created(272):1: connection created
debug   Jun 01 09:38:34 cmap_lib_init_fn(306):8: lib_init_fn: conn=0x7f9434f16230
debug   Jun 01 09:38:34 qb_ipcs_dispatch_connection_request(758):4: HUP conn (2177-28531-33)
debug   Jun 01 09:38:34 qb_ipcs_disconnect(605):4: qb_ipcs_disconnect(2177-28531-33) state:2
debug   Jun 01 09:38:34 _del(117):4: epoll_ctl(del): Bad file descriptor (9)
debug   Jun 01 09:38:34 cs_ipcs_connection_closed(417):1: cs_ipcs_connection_closed() 
debug   Jun 01 09:38:34 cmap_lib_exit_fn(325):8: exit_fn for conn=0x7f9434f16230
debug   Jun 01 09:38:34 cs_ipcs_connection_destroyed(390):1: cs_ipcs_connection_destroyed() 
trace   Jun 01 09:38:34 qb_rb_close(288):4: ENTERING qb_rb_close()
trace   Jun 01 09:38:34 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug   Jun 01 09:38:34 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cmap-response-2177-28531-33-header
trace   Jun 01 09:38:34 qb_rb_close(288):4: ENTERING qb_rb_close()
trace   Jun 01 09:38:34 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug   Jun 01 09:38:34 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cmap-event-2177-28531-33-header
trace   Jun 01 09:38:34 qb_rb_close(288):4: ENTERING qb_rb_close()
trace   Jun 01 09:38:34 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug   Jun 01 09:38:34 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cmap-request-2177-28531-33-header
debug   Jun 01 09:38:34 handle_new_connection(486):4: IPC credentials authenticated (2177-28533-33)
debug   Jun 01 09:38:34 qb_ipcs_shm_connect(294):4: connecting to client [28533]
debug   Jun 01 09:38:34 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug   Jun 01 09:38:34 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug   Jun 01 09:38:34 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug   Jun 01 09:38:34 cs_ipcs_connection_created(272):1: connection created
debug   Jun 01 09:38:34 cmap_lib_init_fn(306):8: lib_init_fn: conn=0x7f9434f16230

There is

_del(117):4: epoll_ctl(del): Bad file descriptor (9)

but that is another descriptor and that messages are not so frequent.

> 
> 
>>
>> strace on a running process doesn't show anything except epoll_wait.
>> ...
>> epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107) 
>> = 1
>> epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107) 
>> = 1
>> epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107) 
>> = 1
>> epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107) 
>> = 1
>> ...
>>
>> But that ones are way to frequent:
>> # timeout 10 strace -p 2177 2>&1 | grep EPOLLIN >/tmp/corosync-epoll.log
>> Terminated
>> # wc -l /tmp/corosync-epoll.log
>> 438399 /tmp/corosync-epoll.log
>>
>> that means: ~43840 times per second.
>>
>> Other nodes show zero.
>> Pacemaker DC is on the another node.
>>
>> Nodes are completely identical.
>>
>> fd 19 which generates that events is shown in lsof this way:
>> corosync   2177      root   19u     unix 0xffff88062f896680          0t0
>>       17987 socket
>>
>> netstat for that inode (17987) shows:
>> unix  3      [ ]         STREAM     CONNECTED     17987  2177/corosync
>>       @cpg
>>
>> So that socket is used by CPG.
>>
>> nearest socket inode (connecting one, 17986) is used by pacemakerd.
>>
>> strace of pacemakerd shows absolutely normal
>> poll([{fd=8, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5,
>> events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}], 4, 500) = 0 (Timeout)
>> poll([{fd=8, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5,
>> events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}], 4, 500) = 0 (Timeout)
>> poll([{fd=8, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5,
>> events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}], 4, 500) = 0 (Timeout)
>>
> 
> AFAIK pacemaker doesn't use libqb qb_loop but corosync does. That's the 
> reason.
> 
>> So, this looks like a defect, but where?
>> libqb seems to be the main suspect, but I'm not sure.
>>
>> That is centos6, corosync 53f67a2 on top of libqb 0.17.1 (recompile of
>> David's 0.17.1-1 dated Tue Aug 26 2014).
>> Pacemaker is fbc239b.
> 
> Are you able to reproduce bug after corosync/node restart? If so, could 

No, it's fully sporadic, actually that is the first time I saw it.
I thought about trying gdb, but that will ultimately break the cluster even
with the immediate "continue".
ltrace does not show anything either.
May be somebody have another ideas on how to debug that?

> you try libqb master? There was bug in libqb 
> https://github.com/ClusterLabs/libqb/pull/147 in qb_loop part so maybe 
> related.

I had a look on it already, but was not 100% sure it is related.
Sadly. I do not have a procedure to reproduce, so I' d prefer to have it all
running in that state for some time in the hope someone can propose a good way to diagnose
or say for sure it is #147, and only upgrade after the resolution or timeout.

Best,
Vladislav





More information about the Users mailing list