[ClusterLabs] corosync eats the whole CPU core in epoll_wait() on one node in cluster
Vladislav Bogdanov
bubble at hoster-ok.com
Mon Jun 1 10:09:05 UTC 2015
Hi Honza,
01.06.2015 10:21, Jan Friesse wrote:
> Vladislav
>
> Vladislav Bogdanov napsal(a):
>> Hi,
>>
>> Just noticed subj on just one node in 4-node cluster.
>>
>> I've dumped blackbox logs, but unfortunately that didn't help me to
>> understand what's going on because even debug logs are too slender.
>
> Do you still have them? Because maybe few lines from them may be helpful.
Yep, here is the tail for the last 5 min. Do you need the whole thing?
debug Jun 01 09:34:01 message_handler_req_lib_quorum_getquorate(396):2: got quorate request on 0x7f9434f12390
debug Jun 01 09:35:01 message_handler_req_lib_quorum_getquorate(396):2: got quorate request on 0x7f9434f12390
debug Jun 01 09:36:01 message_handler_req_lib_quorum_getquorate(396):2: got quorate request on 0x7f9434f12390
debug Jun 01 09:37:01 message_handler_req_lib_quorum_getquorate(396):2: got quorate request on 0x7f9434f12390
debug Jun 01 09:37:54 handle_new_connection(486):4: IPC credentials authenticated (2177-26592-33)
debug Jun 01 09:37:54 qb_ipcs_shm_connect(294):4: connecting to client [26592]
debug Jun 01 09:37:54 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug Jun 01 09:37:54 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug Jun 01 09:37:54 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug Jun 01 09:37:54 cs_ipcs_connection_created(272):1: connection created
debug Jun 01 09:37:54 cpg_lib_init_fn(1637):7: lib_init_fn: conn=0x7f9434f16230, cpd=0x7f9434f169b4
debug Jun 01 09:37:54 message_handler_req_lib_cpg_finalize(1760):7: cpg finalize for conn=0x7f9434f16230
debug Jun 01 09:37:54 qb_ipcs_dispatch_connection_request(758):4: HUP conn (2177-26592-33)
debug Jun 01 09:37:54 qb_ipcs_disconnect(605):4: qb_ipcs_disconnect(2177-26592-33) state:2
debug Jun 01 09:37:54 _del(117):4: epoll_ctl(del): Bad file descriptor (9)
debug Jun 01 09:37:54 cs_ipcs_connection_closed(417):1: cs_ipcs_connection_closed()
debug Jun 01 09:37:54 cpg_lib_exit_fn(1131):7: exit_fn for conn=0x7f9434f16230
debug Jun 01 09:37:54 cs_ipcs_connection_destroyed(390):1: cs_ipcs_connection_destroyed()
trace Jun 01 09:37:54 qb_rb_close(288):4: ENTERING qb_rb_close()
trace Jun 01 09:37:54 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug Jun 01 09:37:54 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cpg-response-2177-26592-33-header
trace Jun 01 09:37:54 qb_rb_close(288):4: ENTERING qb_rb_close()
trace Jun 01 09:37:54 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug Jun 01 09:37:54 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cpg-event-2177-26592-33-header
trace Jun 01 09:37:54 qb_rb_close(288):4: ENTERING qb_rb_close()
trace Jun 01 09:37:54 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug Jun 01 09:37:54 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cpg-request-2177-26592-33-header
debug Jun 01 09:37:54 handle_new_connection(486):4: IPC credentials authenticated (2177-26592-33)
debug Jun 01 09:37:54 qb_ipcs_shm_connect(294):4: connecting to client [26592]
debug Jun 01 09:37:54 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug Jun 01 09:37:54 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug Jun 01 09:37:54 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug Jun 01 09:37:54 cs_ipcs_connection_created(272):1: connection created
debug Jun 01 09:37:54 cmap_lib_init_fn(306):8: lib_init_fn: conn=0x7f9434f16230
debug Jun 01 09:37:54 qb_ipcs_dispatch_connection_request(758):4: HUP conn (2177-26592-33)
debug Jun 01 09:37:54 qb_ipcs_disconnect(605):4: qb_ipcs_disconnect(2177-26592-33) state:2
debug Jun 01 09:37:54 _del(117):4: epoll_ctl(del): Bad file descriptor (9)
debug Jun 01 09:37:54 cs_ipcs_connection_closed(417):1: cs_ipcs_connection_closed()
debug Jun 01 09:37:54 cmap_lib_exit_fn(325):8: exit_fn for conn=0x7f9434f16230
debug Jun 01 09:37:54 cs_ipcs_connection_destroyed(390):1: cs_ipcs_connection_destroyed()
trace Jun 01 09:37:54 qb_rb_close(288):4: ENTERING qb_rb_close()
trace Jun 01 09:37:54 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug Jun 01 09:37:54 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cmap-response-2177-26592-33-header
trace Jun 01 09:37:54 qb_rb_close(288):4: ENTERING qb_rb_close()
trace Jun 01 09:37:54 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug Jun 01 09:37:54 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cmap-event-2177-26592-33-header
trace Jun 01 09:37:54 qb_rb_close(288):4: ENTERING qb_rb_close()
trace Jun 01 09:37:54 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug Jun 01 09:37:54 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cmap-request-2177-26592-33-header
debug Jun 01 09:37:54 handle_new_connection(486):4: IPC credentials authenticated (2177-26594-33)
debug Jun 01 09:37:54 qb_ipcs_shm_connect(294):4: connecting to client [26594]
debug Jun 01 09:37:54 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug Jun 01 09:37:54 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug Jun 01 09:37:54 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug Jun 01 09:37:54 cs_ipcs_connection_created(272):1: connection created
debug Jun 01 09:37:54 cpg_lib_init_fn(1637):7: lib_init_fn: conn=0x7f9434f16230, cpd=0x7f9434f169b4
debug Jun 01 09:37:54 message_handler_req_lib_cpg_finalize(1760):7: cpg finalize for conn=0x7f9434f16230
debug Jun 01 09:37:54 qb_ipcs_dispatch_connection_request(758):4: HUP conn (2177-26594-33)
debug Jun 01 09:37:54 qb_ipcs_disconnect(605):4: qb_ipcs_disconnect(2177-26594-33) state:2
debug Jun 01 09:37:54 _del(117):4: epoll_ctl(del): Bad file descriptor (9)
debug Jun 01 09:37:54 cs_ipcs_connection_closed(417):1: cs_ipcs_connection_closed()
debug Jun 01 09:37:54 cpg_lib_exit_fn(1131):7: exit_fn for conn=0x7f9434f16230
debug Jun 01 09:37:54 cs_ipcs_connection_destroyed(390):1: cs_ipcs_connection_destroyed()
trace Jun 01 09:37:54 qb_rb_close(288):4: ENTERING qb_rb_close()
trace Jun 01 09:37:54 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug Jun 01 09:37:54 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cpg-response-2177-26594-33-header
trace Jun 01 09:37:54 qb_rb_close(288):4: ENTERING qb_rb_close()
trace Jun 01 09:37:54 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug Jun 01 09:37:54 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cpg-event-2177-26594-33-header
trace Jun 01 09:37:54 qb_rb_close(288):4: ENTERING qb_rb_close()
trace Jun 01 09:37:54 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug Jun 01 09:37:54 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cpg-request-2177-26594-33-header
debug Jun 01 09:37:54 handle_new_connection(486):4: IPC credentials authenticated (2177-26594-33)
debug Jun 01 09:37:54 qb_ipcs_shm_connect(294):4: connecting to client [26594]
debug Jun 01 09:37:54 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug Jun 01 09:37:54 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug Jun 01 09:37:54 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug Jun 01 09:37:54 cs_ipcs_connection_created(272):1: connection created
debug Jun 01 09:37:54 cmap_lib_init_fn(306):8: lib_init_fn: conn=0x7f9434f16230
debug Jun 01 09:37:54 qb_ipcs_dispatch_connection_request(758):4: HUP conn (2177-26594-33)
debug Jun 01 09:37:54 qb_ipcs_disconnect(605):4: qb_ipcs_disconnect(2177-26594-33) state:2
debug Jun 01 09:37:54 _del(117):4: epoll_ctl(del): Bad file descriptor (9)
debug Jun 01 09:37:54 cs_ipcs_connection_closed(417):1: cs_ipcs_connection_closed()
debug Jun 01 09:37:54 cmap_lib_exit_fn(325):8: exit_fn for conn=0x7f9434f16230
debug Jun 01 09:37:54 cs_ipcs_connection_destroyed(390):1: cs_ipcs_connection_destroyed()
trace Jun 01 09:37:54 qb_rb_close(288):4: ENTERING qb_rb_close()
trace Jun 01 09:37:54 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug Jun 01 09:37:54 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cmap-response-2177-26594-33-header
trace Jun 01 09:37:54 qb_rb_close(288):4: ENTERING qb_rb_close()
trace Jun 01 09:37:54 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug Jun 01 09:37:54 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cmap-event-2177-26594-33-header
trace Jun 01 09:37:54 qb_rb_close(288):4: ENTERING qb_rb_close()
trace Jun 01 09:37:54 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug Jun 01 09:37:54 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cmap-request-2177-26594-33-header
debug Jun 01 09:38:01 message_handler_req_lib_quorum_getquorate(396):2: got quorate request on 0x7f9434f12390
debug Jun 01 09:38:34 handle_new_connection(486):4: IPC credentials authenticated (2177-28531-33)
debug Jun 01 09:38:34 qb_ipcs_shm_connect(294):4: connecting to client [28531]
debug Jun 01 09:38:34 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug Jun 01 09:38:34 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug Jun 01 09:38:34 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug Jun 01 09:38:34 cs_ipcs_connection_created(272):1: connection created
debug Jun 01 09:38:34 cmap_lib_init_fn(306):8: lib_init_fn: conn=0x7f9434f16230
debug Jun 01 09:38:34 qb_ipcs_dispatch_connection_request(758):4: HUP conn (2177-28531-33)
debug Jun 01 09:38:34 qb_ipcs_disconnect(605):4: qb_ipcs_disconnect(2177-28531-33) state:2
debug Jun 01 09:38:34 _del(117):4: epoll_ctl(del): Bad file descriptor (9)
debug Jun 01 09:38:34 cs_ipcs_connection_closed(417):1: cs_ipcs_connection_closed()
debug Jun 01 09:38:34 cmap_lib_exit_fn(325):8: exit_fn for conn=0x7f9434f16230
debug Jun 01 09:38:34 cs_ipcs_connection_destroyed(390):1: cs_ipcs_connection_destroyed()
trace Jun 01 09:38:34 qb_rb_close(288):4: ENTERING qb_rb_close()
trace Jun 01 09:38:34 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug Jun 01 09:38:34 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cmap-response-2177-28531-33-header
trace Jun 01 09:38:34 qb_rb_close(288):4: ENTERING qb_rb_close()
trace Jun 01 09:38:34 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug Jun 01 09:38:34 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cmap-event-2177-28531-33-header
trace Jun 01 09:38:34 qb_rb_close(288):4: ENTERING qb_rb_close()
trace Jun 01 09:38:34 my_posix_sem_destroy(91):4: ENTERING my_posix_sem_destroy()
debug Jun 01 09:38:34 qb_rb_close(299):4: Free'ing ringbuffer: /dev/shm/qb-cmap-request-2177-28531-33-header
debug Jun 01 09:38:34 handle_new_connection(486):4: IPC credentials authenticated (2177-28533-33)
debug Jun 01 09:38:34 qb_ipcs_shm_connect(294):4: connecting to client [28533]
debug Jun 01 09:38:34 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug Jun 01 09:38:34 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug Jun 01 09:38:34 qb_rb_open_2(236):4: shm size:1048589; real_size:1052672; rb->word_size:263168
debug Jun 01 09:38:34 cs_ipcs_connection_created(272):1: connection created
debug Jun 01 09:38:34 cmap_lib_init_fn(306):8: lib_init_fn: conn=0x7f9434f16230
There is
_del(117):4: epoll_ctl(del): Bad file descriptor (9)
but that is another descriptor and that messages are not so frequent.
>
>
>>
>> strace on a running process doesn't show anything except epoll_wait.
>> ...
>> epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107)
>> = 1
>> epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107)
>> = 1
>> epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107)
>> = 1
>> epoll_wait(4, {{EPOLLIN, {u32=19, u64=3703511490016313363}}}, 12, 107)
>> = 1
>> ...
>>
>> But that ones are way to frequent:
>> # timeout 10 strace -p 2177 2>&1 | grep EPOLLIN >/tmp/corosync-epoll.log
>> Terminated
>> # wc -l /tmp/corosync-epoll.log
>> 438399 /tmp/corosync-epoll.log
>>
>> that means: ~43840 times per second.
>>
>> Other nodes show zero.
>> Pacemaker DC is on the another node.
>>
>> Nodes are completely identical.
>>
>> fd 19 which generates that events is shown in lsof this way:
>> corosync 2177 root 19u unix 0xffff88062f896680 0t0
>> 17987 socket
>>
>> netstat for that inode (17987) shows:
>> unix 3 [ ] STREAM CONNECTED 17987 2177/corosync
>> @cpg
>>
>> So that socket is used by CPG.
>>
>> nearest socket inode (connecting one, 17986) is used by pacemakerd.
>>
>> strace of pacemakerd shows absolutely normal
>> poll([{fd=8, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5,
>> events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}], 4, 500) = 0 (Timeout)
>> poll([{fd=8, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5,
>> events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}], 4, 500) = 0 (Timeout)
>> poll([{fd=8, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5,
>> events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}], 4, 500) = 0 (Timeout)
>>
>
> AFAIK pacemaker doesn't use libqb qb_loop but corosync does. That's the
> reason.
>
>> So, this looks like a defect, but where?
>> libqb seems to be the main suspect, but I'm not sure.
>>
>> That is centos6, corosync 53f67a2 on top of libqb 0.17.1 (recompile of
>> David's 0.17.1-1 dated Tue Aug 26 2014).
>> Pacemaker is fbc239b.
>
> Are you able to reproduce bug after corosync/node restart? If so, could
No, it's fully sporadic, actually that is the first time I saw it.
I thought about trying gdb, but that will ultimately break the cluster even
with the immediate "continue".
ltrace does not show anything either.
May be somebody have another ideas on how to debug that?
> you try libqb master? There was bug in libqb
> https://github.com/ClusterLabs/libqb/pull/147 in qb_loop part so maybe
> related.
I had a look on it already, but was not 100% sure it is related.
Sadly. I do not have a procedure to reproduce, so I' d prefer to have it all
running in that state for some time in the hope someone can propose a good way to diagnose
or say for sure it is #147, and only upgrade after the resolution or timeout.
Best,
Vladislav
More information about the Users
mailing list