[ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Thu Feb 14 19:12:09 UTC 2019

Edvin,
> Apologies for top posting, the strace you asked for is available here (although running strace itself had side-effect of getting corosync out of the live lock):

Yep, I know, but this was after strace finished, right? So links should 
contain time when corosync was stucked, right?

> https://clbin.com/9kOUM

I will take a look deeper into this tommorow, but after brief read it 
looks fine (with exception of

epoll_wait(4, {{EPOLLIN, {u32=15, u64=2526859986156388367}}}, 12, 25) = 1

block).

Regards,
   Honza

> 
> Best regards,
> --Edwin
> ________________________________________
> From: Jan Friesse <jfriesse at redhat.com>
> Sent: 14 February 2019 18:34
> To: Cluster Labs - All topics related to open-source clustering welcomed; Edvin Torok
> Cc: Mark Syms
> Subject: Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?
> 
> Edwin,
> 
>> Hello,
>>
>> We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and
>> noticed a fundamental problem with realtime priorities:
>> - corosync runs on CPU3, and interrupts for the NIC used by corosync are
>> also routed to CPU3
>> - corosync runs with SCHED_RR, ksoftirqd does not (should it?), but
>> without it packets sent/received from that interface would not get processed
>> - corosync is in a busy loop using 100% CPU, never giving a chance for
>> softirqs to be processed (including TIMER and SCHED)
>>
>> This appears to be a priority inversion problem, if corosync runs as
>> realtime then everything it needs (timers...) should be realtime as
>> well, otherwise running as realtime guarantees we'll miss the watchdog
>> deadline, instead of guaranteeing that we process the data before the
>> deadline.
>>
>> Do you have some advice on what the expected realtime priorities would
>> be for:
>> - corosync
>> - sbd
>> - hard irqs
>> - soft irqs
>>
>> Also would it be possible for corosync to avoid hogging the CPU in libqb?
>> (Our hypothesis is that if softirqs are not processed then timers
>> wouldn't work for processes on that CPU either)
>>
>> Some more background and analysis:
>>
>> We noticed that cluster membership changes were very unstable
>> (especially as you approach 16 physical host clusters) causing the whole
>> cluster to fence when adding a single node. This was previously working
>> fairly reliably with a 4.4 based kernel.
>>
>> I've increased SBD timeout to 3600 to be able to investigate the problem
>> and noticed that corosync was using 100% of CPU3 [1] and I immediately
>> lost SSH access on eth0 (corosync was using eth1), where eth0's
>> interrupts were also processed on CPU3 (and irqbalance didn't move it).
>>
>> IIUC SCHED_RR tasks should not be able to take up 100% of CPU, according
>> to [3] it shouldn't be allowed to use more than 95% of CPU.
>>
>> Softirqs were not processed at all on CPU3 (see [2], the numbers in the
>> CPU3 column did not change, the numbers in the other columns did).
>> Tried decreasing priority of corosync using chrt to 1, which didn't
>> help. I then increased the priority of ksoftirqd to 50 using chrt, which
>> immediately solved the CPU usage problem on corosync.
>>
>> I tried a simple infinite loop program with realtime priority, but it
>> didn't reproduce the problems with interrupts getting stuck.
>>
>>
>> Three problems here:
>> * all softirqs were stuck (not being processed) on CPU3, which included
>> TIMER and SCHED. corosync relies quite heavily on timers, would lack of
>> processing them cause the 100% CPU usage?
>> * is there a kernel bug introduced between 4.4 - 4.19 that causes
>> realtime tasks to not respect the 95% limit anymore? This would leave 5%
>> time for IRQs, including NIC IRQs
>> *  if corosync runs at higher priority than the kernel softirq thread
>> processing NIC IRQ how is corosync expecting incoming packets to be
>> processed, if it is hogging the CPU by receiving, polling and sending
>> packets?
>>
>> On another host which exhibited the same problem I've run strace (which
>> also had the side-effect of getting corosync unstuck from 100% CPU use
>> after strace finished):
>> 1 bind
>> 5 close
>> 688 epoll_wait
>> 8 futex
>> 1 getsockname
>> 3 ioctl
>> 1 open
>> 3 recvfrom
>> 190 recvmsg
>> 87 sendmsg
>> 9 sendto
>> 4 socket
>> 6 write
>>
> 
> don't you have strace by any chance? Corosync itself shouldn't really
> stuck for such long especially if softirq was stuck so it couldn't get
> any packets, right? So it may be (take it as a shot in the dark) that
> some error code is not processed and instead there is some kind of never
> ending loop.
> 
> Honza
> 
>> On yet another host I've run gdb while corosync was stuck:
>> Thread 2 (Thread 0x7f6fd0c9b700 (LWP 16245)):
>> #0 0x00007f6fd34a0afb in do_futex_wait.constprop.1 ()
>> from /lib64/libpthread.so.0
>> #1 0x00007f6fd34a0b8f in __new_sem_wait_slow.constprop.0 ()
>> from /lib64/libpthread.so.0
>> #2 0x00007f6fd34a0c2b in sem_wait@@GLIBC_2.2.5 () from
>> /lib64/libpthread.so.0
>> #3 0x00007f6fd3b38991 in qb_logt_worker_thread () from /lib64/libqb.so.0
>> #4 0x00007f6fd349ae25 in start_thread () from /lib64/libpthread.so.0
>> #5 0x00007f6fd31c4bad in clone () from /lib64/libc.so.6
>>
>> Thread 1 (Thread 0x7f6fd43c7b80 (LWP 16242)):
>> #0 0x00007f6fd31c5183 in epoll_wait () from /lib64/libc.so.6
>> #1 0x00007f6fd3b3dea8 in poll_and_add_to_jobs () from /lib64/libqb.so.0
>> #2 0x00007f6fd3b2ed93 in qb_loop_run () from /lib64/libqb.so.0
>> #3 0x000055592d62ff78 in main ()
>>
>>
>> [1]
>> top - 15:51:38 up 47 min,  2 users,  load average: 3.81, 1.70, 0.70
>> Tasks: 208 total,   4 running, 130 sleeping,   0 stopped,   0 zombie
>> %Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
>> 0.0 st
>> %Cpu1  : 53.8 us, 46.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,
>> 0.0 st
>> %Cpu2  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
>> 0.0 st
>> %Cpu3  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
>> 0.0 st
>> %Cpu4  :  3.8 us,  0.0 sy,  0.0 ni, 96.2 id,  0.0 wa,  0.0 hi,  0.0 si,
>> 0.0 st
>> %Cpu5  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
>> 0.0 st
>> %Cpu6  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
>> 0.0 st
>> %Cpu7  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
>> 0.0 st
>> KiB Mem :  4210184 total,  3374752 free,   387380 used,   448052 buff/cache
>> KiB Swap:  1048572 total,  1048572 free,        0 used.  3660276 avail Mem
>>
>>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
>> 16217 root      rt   0  194832  92532  66128 R 104.0  2.2   2:54.00 corosync
>> 24004 root      20   0  161984   4432   3696 R   4.0  0.1   0:00.02 top
>>
>>    cat /proc/interrupts |grep eth
>>    88:          2          0          0          0          0          0
>>          0          0  xen-pirq    -msi-x     eth0
>>    89:       2805          0          0     419621          0          0
>>          0          0  xen-pirq    -msi-x     eth0-TxRx-0
>>    90:          2          0          0          0          0          0
>>          1          0  xen-pirq    -msi-x     eth1
>>    91:        435     171086          0          0          0          0
>>          0          0  xen-pirq    -msi-x     eth1-TxRx-0
>>
>> [2]
>> watch cat /proc/softirqs
>>                       CPU0       CPU1       CPU2       CPU3       CPU4
>>     CPU5       CPU6       CPU7
>>             HI:          1          0          0          0          0
>>        0          0          0
>>          TIMER:     355339     786951     367148     266583     389591
>> 357184     369577     374880
>>         NET_TX:          8       1081          1        743          1
>>        1        223          5
>>         NET_RX:       5807     230818       2709     529755       3323
>>     2598       2550       1878
>>          BLOCK:       8166         53         42         13         10
>>     6563         21       6894
>>       IRQ_POLL:          0          0          0          0          0
>>        0          0          0
>>        TASKLET:        103          6          9         85          0
>>        0          0          0
>>          SCHED:     240428     100842      83166      51213      77824
>>    77270      72074      72261
>>        HRTIMER:          0          0          0          0          0
>>        0          0          0
>>            RCU:     135491     245345     139513     130591     140280
>> 135094     141007     136063
>>
>>
>> [3]
>> /proc/sys/kernel/sched_rt_runtime_us
>> 950000
>> /proc/sys/kernel/sched_rt_period_us
>> 1000000
>>
>> Best regards,
>> --Edwin
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>