[ClusterLabs] corosync dead loop in segfault handler

Wed Feb 15 16:09:11 EST 2017

On 15/02/17 18:04 +0100, Jan Pokorný wrote:
> On 15/02/17 15:13 +0000, Christine Caulfield wrote:
>> On 15/02/17 14:50, Jan Friesse wrote:
>>>> Hi all,
>>>> 
>>>> Corosync Cluster Engine, version '2.3.4'
>>>> Copyright (c) 2006-2009 Red Hat, Inc.
>>>> 
>>>> Today I found corosync consuming 100% cpu. Strace showed following:
>>>> 
>>>> write(7, "\v\0\0\0", 4)                 = -1 EAGAIN (Resource
>>>> temporarily unavailable)
>>>> write(7, "\v\0\0\0", 4)                 = -1 EAGAIN (Resource
>>>> temporarily unavailable)
>>>> 
>>>> Then I used gcore to get the coredump.
>>>> 
>>>> (gdb) bt
>>>> #0  0x00007f038b74b1cd in write () from /lib64/libpthread.so.0
>>>> #1  0x00007f038b9656ed in _handle_real_signal_ (signal_num=<optimized
>>>> out>, si=<optimized out>, context=<optimized out>) at loop_poll.c:474
>>>> #2  <signal handler called>
>>>> #3  0x0000000000000000 in ?? ()
>>>> #4  0x00007f038c220a3d in schedwrk_processor (context=<optimized out>)
>>>> at sync.c:551
>>>> #5  0x00007f038c23042b in schedwrk_do (type=<optimized out>,
>>>> context=0x6a12d56300000001) at schedwrk.c:77
>>>> #6  0x00007f038bdd49f7 in token_callbacks_execute
>>>> (type=TOTEM_CALLBACK_TOKEN_SENT, instance=<optimized out>) at
>>>> totemsrp.c:3493
>>>> #7  message_handler_orf_token (instance=<optimized out>,
>>>> msg=<optimized out>, endian_conversion_needed=<optimized out>,
>>>> msg_len=<optimized out>) at totemsrp.c:3894
>>>> #8  0x00007f038bdd65a5 in message_handler_orf_token
>>>> (instance=<optimized out>, msg=<optimized out>, msg_len=<optimized
>>>> out>, endian_conversion_needed=<optimized out>) at totemsrp.c:3609
>>>> #9  0x00007f038bdcdfb9 in rrp_deliver_fn (context=0x7f038d541840,
>>>> msg=0x7f038d541af8, msg_len=70) at totemrrp.c:1941
>>>> #10 0x00007f038bdca01e in net_deliver_fn (fd=<optimized out>,
>>>> revents=<optimized out>, data=0x7f038d541a90) at totemudpu.c:499
>>>> #11 0x00007f038b96576f in _poll_dispatch_and_take_back_
>>>> (item=0x7f038d4fe168, p=<optimized out>) at loop_poll.c:108
>>>> #12 0x00007f038b965300 in qb_loop_run_level (level=0x7f038d4fde08) at
>>>> loop.c:43
>>>> #13 qb_loop_run (lp=<optimized out>) at loop.c:210
>>>> #14 0x00007f038c21b6d0 in main (argc=<optimized out>, argv=<optimized
>>>> out>, envp=<optimized out>) at main.c:1383
>>>> 
>>>> (gdb) f 1
>>>> #1  0x00007f038b9656ed in _handle_real_signal_ (signal_num=<optimized
>>>> out>, si=<optimized out>, context=<optimized out>) at loop_poll.c:474
>>>> 474                     res = write(pipe_fds[1], &sig, sizeof(int32_t));
>>>> (gdb) info locals
>>>> sig = 11
>>>> res = <optimized out>
>>>> __func__ = "_handle_real_signal_"
>>>> (gdb) f 4
>>>> #4  0x00007f038c220a3d in schedwrk_processor (context=<optimized out>)
>>>> at sync.c:551
>>>> 551                            
>>>> my_service_list[my_processing_idx].sync_init (my_trans_list,
>>>> (gdb) p my_processing_idx
>>>> $31 = 3
>>>> (gdb) p my_service_list[3]
>>>> $32 = {service_id = 0, sync_init = 0x0, sync_abort = 0x0, sync_process
>>>> = 0x0, sync_activate = 0x0, state = PROCESS, name = '\000 <repeats 127
>>>> times>}
>>>> 
>>>> So it seems  corosync dead looping in segfault handler.
>>>> I have not found any related changelog in the release notes after 2.3.4.
>>>> 
>>>> Can anyone help please?
>>> 
>>> Yep. It looks like (for some reason) signal pipe was not processed and
>>> libqb _handle_real_signal_ is looping. Corosync really cannot do
>>> anything about it. It looks like regular libqb bug, so even you can't do
>>> anything with it. CCing Chrissie so she is aware.
>>> 
>> 
>> Yes, it seems that some corosync SEGVs trigger this obscure bug in
>> libqb. I've chased a few possible causes and none have been fruitful.
>> 
>> If you get this then corosync has crashed, and this other bug is masking
>> the actual diagnostics - I know, helpful :/
> 
> This particularly resembles recent discovery in corosync -- segfault
> handler is not expecting a nested segfault leading to a tight loop
> on signal processing and, due to its priority, eating the CPU off:
> https://github.com/corosync/corosync/issues/159
> 
> Shifting towards the possible solution blueprint side in libqb:
> https://github.com/ClusterLabs/libqb/pull/245
> 
> We could do better if we knew which signal in particular is the
> culprit in this case -- was it indeed SIGSEGV (I don't actually
> think so but it's hard to say)?

Ah, I missed "sig = 11" above, so indeed SIGSEGV.

Anyway, there is a chance that libqb v1.0.1 (containing this PR:
https://github.com/ClusterLabs/libqb/pull/230) alleviates the issue.
I am still missing some parts of the picture.

-- 
Jan (Poki)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20170215/23702056/attachment-0003.sig>