[ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Thu Feb 14 13:54:27 EST 2019

Apologies for top posting, the strace you asked for is available here (although running strace itself had side-effect of getting corosync out of the live lock):
https://clbin.com/9kOUM

Best regards,
--Edwin
________________________________________
From: Jan Friesse <jfriesse at redhat.com>
Sent: 14 February 2019 18:34
To: Cluster Labs - All topics related to open-source clustering welcomed; Edvin Torok
Cc: Mark Syms
Subject: Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Edwin,

> Hello,
>
> We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and
> noticed a fundamental problem with realtime priorities:
> - corosync runs on CPU3, and interrupts for the NIC used by corosync are
> also routed to CPU3
> - corosync runs with SCHED_RR, ksoftirqd does not (should it?), but
> without it packets sent/received from that interface would not get processed
> - corosync is in a busy loop using 100% CPU, never giving a chance for
> softirqs to be processed (including TIMER and SCHED)
>
> This appears to be a priority inversion problem, if corosync runs as
> realtime then everything it needs (timers...) should be realtime as
> well, otherwise running as realtime guarantees we'll miss the watchdog
> deadline, instead of guaranteeing that we process the data before the
> deadline.
>
> Do you have some advice on what the expected realtime priorities would
> be for:
> - corosync
> - sbd
> - hard irqs
> - soft irqs
>
> Also would it be possible for corosync to avoid hogging the CPU in libqb?
> (Our hypothesis is that if softirqs are not processed then timers
> wouldn't work for processes on that CPU either)
>
> Some more background and analysis:
>
> We noticed that cluster membership changes were very unstable
> (especially as you approach 16 physical host clusters) causing the whole
> cluster to fence when adding a single node. This was previously working
> fairly reliably with a 4.4 based kernel.
>
> I've increased SBD timeout to 3600 to be able to investigate the problem
> and noticed that corosync was using 100% of CPU3 [1] and I immediately
> lost SSH access on eth0 (corosync was using eth1), where eth0's
> interrupts were also processed on CPU3 (and irqbalance didn't move it).
>
> IIUC SCHED_RR tasks should not be able to take up 100% of CPU, according
> to [3] it shouldn't be allowed to use more than 95% of CPU.
>
> Softirqs were not processed at all on CPU3 (see [2], the numbers in the
> CPU3 column did not change, the numbers in the other columns did).
> Tried decreasing priority of corosync using chrt to 1, which didn't
> help. I then increased the priority of ksoftirqd to 50 using chrt, which
> immediately solved the CPU usage problem on corosync.
>
> I tried a simple infinite loop program with realtime priority, but it
> didn't reproduce the problems with interrupts getting stuck.
>
>
> Three problems here:
> * all softirqs were stuck (not being processed) on CPU3, which included
> TIMER and SCHED. corosync relies quite heavily on timers, would lack of
> processing them cause the 100% CPU usage?
> * is there a kernel bug introduced between 4.4 - 4.19 that causes
> realtime tasks to not respect the 95% limit anymore? This would leave 5%
> time for IRQs, including NIC IRQs
> *  if corosync runs at higher priority than the kernel softirq thread
> processing NIC IRQ how is corosync expecting incoming packets to be
> processed, if it is hogging the CPU by receiving, polling and sending
> packets?
>
> On another host which exhibited the same problem I've run strace (which
> also had the side-effect of getting corosync unstuck from 100% CPU use
> after strace finished):
> 1 bind
> 5 close
> 688 epoll_wait
> 8 futex
> 1 getsockname
> 3 ioctl
> 1 open
> 3 recvfrom
> 190 recvmsg
> 87 sendmsg
> 9 sendto
> 4 socket
> 6 write
>

don't you have strace by any chance? Corosync itself shouldn't really
stuck for such long especially if softirq was stuck so it couldn't get
any packets, right? So it may be (take it as a shot in the dark) that
some error code is not processed and instead there is some kind of never
ending loop.

Honza

> On yet another host I've run gdb while corosync was stuck:
> Thread 2 (Thread 0x7f6fd0c9b700 (LWP 16245)):
> #0 0x00007f6fd34a0afb in do_futex_wait.constprop.1 ()
> from /lib64/libpthread.so.0
> #1 0x00007f6fd34a0b8f in __new_sem_wait_slow.constprop.0 ()
> from /lib64/libpthread.so.0
> #2 0x00007f6fd34a0c2b in sem_wait@@GLIBC_2.2.5 () from
> /lib64/libpthread.so.0
> #3 0x00007f6fd3b38991 in qb_logt_worker_thread () from /lib64/libqb.so.0
> #4 0x00007f6fd349ae25 in start_thread () from /lib64/libpthread.so.0
> #5 0x00007f6fd31c4bad in clone () from /lib64/libc.so.6
>
> Thread 1 (Thread 0x7f6fd43c7b80 (LWP 16242)):
> #0 0x00007f6fd31c5183 in epoll_wait () from /lib64/libc.so.6
> #1 0x00007f6fd3b3dea8 in poll_and_add_to_jobs () from /lib64/libqb.so.0
> #2 0x00007f6fd3b2ed93 in qb_loop_run () from /lib64/libqb.so.0
> #3 0x000055592d62ff78 in main ()
>
>
> [1]
> top - 15:51:38 up 47 min,  2 users,  load average: 3.81, 1.70, 0.70
> Tasks: 208 total,   4 running, 130 sleeping,   0 stopped,   0 zombie
> %Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
> 0.0 st
> %Cpu1  : 53.8 us, 46.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,
> 0.0 st
> %Cpu2  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
> 0.0 st
> %Cpu3  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
> 0.0 st
> %Cpu4  :  3.8 us,  0.0 sy,  0.0 ni, 96.2 id,  0.0 wa,  0.0 hi,  0.0 si,
> 0.0 st
> %Cpu5  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
> 0.0 st
> %Cpu6  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
> 0.0 st
> %Cpu7  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
> 0.0 st
> KiB Mem :  4210184 total,  3374752 free,   387380 used,   448052 buff/cache
> KiB Swap:  1048572 total,  1048572 free,        0 used.  3660276 avail Mem
>
>    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
> 16217 root      rt   0  194832  92532  66128 R 104.0  2.2   2:54.00 corosync
> 24004 root      20   0  161984   4432   3696 R   4.0  0.1   0:00.02 top
>
>   cat /proc/interrupts |grep eth
>   88:          2          0          0          0          0          0
>         0          0  xen-pirq    -msi-x     eth0
>   89:       2805          0          0     419621          0          0
>         0          0  xen-pirq    -msi-x     eth0-TxRx-0
>   90:          2          0          0          0          0          0
>         1          0  xen-pirq    -msi-x     eth1
>   91:        435     171086          0          0          0          0
>         0          0  xen-pirq    -msi-x     eth1-TxRx-0
>
> [2]
> watch cat /proc/softirqs
>                      CPU0       CPU1       CPU2       CPU3       CPU4
>    CPU5       CPU6       CPU7
>            HI:          1          0          0          0          0
>       0          0          0
>         TIMER:     355339     786951     367148     266583     389591
> 357184     369577     374880
>        NET_TX:          8       1081          1        743          1
>       1        223          5
>        NET_RX:       5807     230818       2709     529755       3323
>    2598       2550       1878
>         BLOCK:       8166         53         42         13         10
>    6563         21       6894
>      IRQ_POLL:          0          0          0          0          0
>       0          0          0
>       TASKLET:        103          6          9         85          0
>       0          0          0
>         SCHED:     240428     100842      83166      51213      77824
>   77270      72074      72261
>       HRTIMER:          0          0          0          0          0
>       0          0          0
>           RCU:     135491     245345     139513     130591     140280
> 135094     141007     136063
>
>
> [3]
> /proc/sys/kernel/sched_rt_runtime_us
> 950000
> /proc/sys/kernel/sched_rt_period_us
> 1000000
>
> Best regards,
> --Edwin
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>