[ClusterLabs] Antw: corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Fri Feb 15 07:48:39 UTC 2019

Ulrich Windl napsal(a):
> Hi!
> 
> IMHO any process running at real-time priorities must make sure that it
> consumes the CPU only for shorrt moment that are really critical to be
> performed in time. Specifically having some code that performs poorly (for
> various reasons) is absolutely _not_ a candidate to be run with real-time
> priorities to fix the bad performance!
> So if corosync is using 100% CPU in real-time, this says something about the
> code quality in corosync IMHO.
> 
> Also SCHED_RR is even more cooperative than SCHED_FIFO, and another interesting
> topic is which of the 100 real-time priorities to assign to which process.
> (I've written some C code that allows to select the scheduling mechanism and
> the priority via command-line argument, so the user and not the program is
> responsible if the system locks up. Maybe corosync should thing about something
> similar.

And this is exactly why corosync option -p (-P) exists (in 3.x these 
were moved to corosync.conf as a sched_rr/priority).

> 
> Personally I also think that a program that sends megabytes of XML as
> realtime-priority task through the network is broken by design: If you care
> about response time, minimize the data and processing required before using
> real-time priorities.
> 
> As a final remark: Recently due to some software bug we had a server with load
> 800 or slightly more that had no problem with performance, and the problem was
> noticed only because we had some load monitoring and alerting configured. On a
> badly configured system we had SDB timeouts over the weekend at much lower
> load. Debugging showed there was an I/O bottleneck causing disk response times
> to shoot up (which in turn made me write that real-time monitoring program
> cited above).
> 
> 
> Sorry if this thought do not answer your question and do not fix your problem.
> I suspect that things get significantly more serious if some processes can only
> run on specific CPUs, and such a CPU is performing a RT task (otherwis a
> different CPU will take care of the non-RT task). At least that's the last
> Linux task scheduler I know...
> 
> Regards,
> Ulrich
> 
> 
>>>> Edwin Török <edvin.torok at citrix.com> 14.02.19 18.34 Uhr >>>
> Hello,
> 
> We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and
> noticed a fundamental problem with realtime priorities:
> - corosync runs on CPU3, and interrupts for the NIC used by corosync are
> also routed to CPU3
> - corosync runs with SCHED_RR, ksoftirqd does not (should it?), but
> without it packets sent/received from that interface would not get processed
> - corosync is in a busy loop using 100% CPU, never giving a chance for
> softirqs to be processed (including TIMER and SCHED)
> 
> This appears to be a priority inversion problem, if corosync runs as
> realtime then everything it needs (timers...) should be realtime as
> well, otherwise running as realtime guarantees we'll miss the watchdog
> deadline, instead of guaranteeing that we process the data before the
> deadline.
> 
> Do you have some advice on what the expected realtime priorities would
> be for:
> - corosync
> - sbd
> - hard irqs
> - soft irqs
> 
> Also would it be possible for corosync to avoid hogging the CPU in libqb?
> (Our hypothesis is that if softirqs are not processed then timers
> wouldn't work for processes on that CPU either)
> 
> Some more background and analysis:
> 
> We noticed that cluster membership changes were very unstable
> (especially as you approach 16 physical host clusters) causing the whole
> cluster to fence when adding a single node. This was previously working
> fairly reliably with a 4.4 based kernel.
> 
> I've increased SBD timeout to 3600 to be able to investigate the problem
> and noticed that corosync was using 100% of CPU3 [1] and I immediately
> lost SSH access on eth0 (corosync was using eth1), where eth0's
> interrupts were also processed on CPU3 (and irqbalance didn't move it).
> 
> IIUC SCHED_RR tasks should not be able to take up 100% of CPU, according
> to [3] it shouldn't be allowed to use more than 95% of CPU.
> 
> Softirqs were not processed at all on CPU3 (see [2], the numbers in the
> CPU3 column did not change, the numbers in the other columns did).
> Tried decreasing priority of corosync using chrt to 1, which didn't
> help. I immediately solved the CPU usage problem on corosync.
> 
> I tried a simple infinite loop program with realtime priority, but it
> didn't reproduce the problems with interrupts getting stuck.
> 
> 
> Three problems here:
> * all softirqs were stuck (not being processed) on CPU3, which included
> TIMER and SCHED. corosync relies quite heavily on timers, would lack of
> processing them cause the 100% CPU usage?
> * is there a kernel bug introduced between 4.4 - 4.19 that causes
> realtime tasks to not respect the 95% limit anymore? This would leave 5%
> time for IRQs, including NIC IRQs
> *  if corosync runs at higher priority than the kernel softirq thread
> processing NIC IRQ how is corosync expecting incoming packets to be
> processed, if it is hogging the CPU by receiving, polling and sending
> packets?
> 
> On another host which exhibited the same problem I've run strace (which
> also had the side-effect of getting corosync unstuck from 100% CPU use
> after strace finished):
> 1 bind
> 5 close
> 688 epoll_wait
> 8 futex
> 1 getsockname
> 3 ioctl
> 1 open
> 3 recvfrom
> 190 recvmsg
> 87 sendmsg
> 9 sendto
> 4 socket
> 6 write
> 
> On yet another host I've run gdb while corosync was stuck:
> Thread 2 (Thread 0x7f6fd0c9b700 (LWP 16245)):
> #0 0x00007f6fd34a0afb in do_futex_wait.constprop.1 ()
> from /lib64/libpthread.so.0
> #1 0x00007f6fd34a0b8f in __new_sem_wait_slow.constprop.0 ()
> from /lib64/libpthread.so.0
> #2 0x00007f6fd34a0c2b in sem_wait@@GLIBC_2.2.5 () from
> /lib64/libpthread.so.0
> #3 0x00007f6fd3b38991 in qb_logt_worker_thread () from /lib64/libqb.so.0
> #4 0x00007f6fd349ae25 in start_thread () from /lib64/libpthread.so.0
> #5 0x00007f6fd31c4bad in clone () from /lib64/libc.so.6
> 
> Thread 1 (Thread 0x7f6fd43c7b80 (LWP 16242)):
> #0 0x00007f6fd31c5183 in epoll_wait () from /lib64/libc.so.6
> #1 0x00007f6fd3b3dea8 in poll_and_add_to_jobs () from /lib64/libqb.so.0
> #2 0x00007f6fd3b2ed93 in qb_loop_run () from /lib64/libqb.so.0
> #3 0x000055592d62ff78 in main ()
> 
> 
> [1]
> top - 15:51:38 up 47 min,  2 users,  load average: 3.81, 1.70, 0.70
> Tasks: 208 total,   4 running, 130 sleeping,   0 stopped,   0 zombie
> %Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
> 0.0 st
> %Cpu1  : 53.8 us, 46.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,
> 0.0 st
> %Cpu2  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
> 0.0 st
> %Cpu3  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
> 0.0 st
> %Cpu4  :  3.8 us,  0.0 sy,  0.0 ni, 96.2 id,  0.0 wa,  0.0 hi,  0.0 si,
> 0.0 st
> %Cpu5  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
> 0.0 st
> %Cpu6  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
> 0.0 st
> %Cpu7  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
> 0.0 st
> KiB Mem :  4210184 total,  3374752 free,   387380 used,   448052 buff/cache
> KiB Swap:  1048572 total,  1048572 free,        0 used.  3660276 avail Mem
> 
>    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
> 16217 root      rt   0  194832  92532  66128 R 104.0  2.2   2:54.00 corosync
> 24004 root      20   0  161984   4432   3696 R   4.0  0.1   0:00.02 top
> 
>   cat /proc/interrupts |grep eth
>   88:          2          0          0          0          0          0
>         0          0  xen-pirq    -msi-x     eth0
>   89:       2805          0          0     419621          0          0
>         0          0  xen-pirq    -msi-x     eth0-TxRx-0
>   90:          2          0          0          0          0          0
>         1          0  xen-pirq    -msi-x     eth1
>   91:        435     171086          0          0          0          0
>         0          0  xen-pirq    -msi-x     eth1-TxRx-0
> 
> [2]
> watch cat /proc/softirqs
>                      CPU0       CPU1       CPU2       CPU3       CPU4
>    CPU5       CPU6       CPU7
>            HI:          1          0          0          0          0
>       0          0          0
>         TIMER:     355339     786951     367148     266583     389591
> 357184     369577     374880
>        NET_TX:          8       1081          1        743          1
>       1          2598       2550       1878
>         BLOCK:       8166         53         42         13         10
>    6563         21       6894
>      IRQ_POLL:          0          0          0          0          0
>       0          0          0
>       TASKLET:        103          6          9         85          0
>       0          0          0
>         SCHED:     240428     100842      83166      51213      77824
>   77270      72074      72261
>       HRTIMER:          0          0          0          0          0
>       0          0          0
>           RCU:     135491     245345     139513     130591     140280
> 135094     141007     136063
> 
> 
> [3]
> /proc/sys/kernel/sched_rt_runtime_us
> 950000
> /proc/sys/kernel/sched_rt_period_us
> 1000000
> 
> Best regards,
> --Edwin
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>