[ClusterLabs] Antw: corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Fri Feb 15 00:23:27 UTC 2019

Hi!

IMHO any process running at real-time priorities must make sure that it
consumes the CPU only for shorrt moment that are really critical to be
performed in time. Specifically having some code that performs poorly (for
various reasons) is absolutely _not_ a candidate to be run with real-time
priorities to fix the bad performance!
So if corosync is using 100% CPU in real-time, this says something about the
code quality in corosync IMHO.

Also SCHED_RR is even more cooperative than SCHED_FIFO, and another interesting
topic is which of the 100 real-time priorities to assign to which process.
(I've written some C code that allows to select the scheduling mechanism and
the priority via command-line argument, so the user and not the program is
responsible if the system locks up. Maybe corosync should thing about something
similar.

Personally I also think that a program that sends megabytes of XML as
realtime-priority task through the network is broken by design: If you care
about response time, minimize the data and processing required before using
real-time priorities.

As a final remark: Recently due to some software bug we had a server with load
800 or slightly more that had no problem with performance, and the problem was
noticed only because we had some load monitoring and alerting configured. On a
badly configured system we had SDB timeouts over the weekend at much lower
load. Debugging showed there was an I/O bottleneck causing disk response times
to shoot up (which in turn made me write that real-time monitoring program
cited above). 

Sorry if this thought do not answer your question and do not fix your problem.
I suspect that things get significantly more serious if some processes can only
run on specific CPUs, and such a CPU is performing a RT task (otherwis a
different CPU will take care of the non-RT task). At least that's the last
Linux task scheduler I know...

Regards,
Ulrich

>>> Edwin Török <edvin.torok at citrix.com> 14.02.19 18.34 Uhr >>>
Hello,

We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and
noticed a fundamental problem with realtime priorities:
- corosync runs on CPU3, and interrupts for the NIC used by corosync are
also routed to CPU3
- corosync runs with SCHED_RR, ksoftirqd does not (should it?), but
without it packets sent/received from that interface would not get processed
- corosync is in a busy loop using 100% CPU, never giving a chance for
softirqs to be processed (including TIMER and SCHED)

This appears to be a priority inversion problem, if corosync runs as
realtime then everything it needs (timers...) should be realtime as
well, otherwise running as realtime guarantees we'll miss the watchdog
deadline, instead of guaranteeing that we process the data before the
deadline.

Do you have some advice on what the expected realtime priorities would
be for:
- corosync
- sbd
- hard irqs
- soft irqs

Also would it be possible for corosync to avoid hogging the CPU in libqb?
(Our hypothesis is that if softirqs are not processed then timers
wouldn't work for processes on that CPU either)

Some more background and analysis:

We noticed that cluster membership changes were very unstable
(especially as you approach 16 physical host clusters) causing the whole
cluster to fence when adding a single node. This was previously working
fairly reliably with a 4.4 based kernel.

I've increased SBD timeout to 3600 to be able to investigate the problem
and noticed that corosync was using 100% of CPU3 [1] and I immediately
lost SSH access on eth0 (corosync was using eth1), where eth0's
interrupts were also processed on CPU3 (and irqbalance didn't move it).

IIUC SCHED_RR tasks should not be able to take up 100% of CPU, according
to [3] it shouldn't be allowed to use more than 95% of CPU.

Softirqs were not processed at all on CPU3 (see [2], the numbers in the
CPU3 column did not change, the numbers in the other columns did).
Tried decreasing priority of corosync using chrt to 1, which didn't
help. I immediately solved the CPU usage problem on corosync.

I tried a simple infinite loop program with realtime priority, but it
didn't reproduce the problems with interrupts getting stuck.

Three problems here:
* all softirqs were stuck (not being processed) on CPU3, which included
TIMER and SCHED. corosync relies quite heavily on timers, would lack of
processing them cause the 100% CPU usage?
* is there a kernel bug introduced between 4.4 - 4.19 that causes
realtime tasks to not respect the 95% limit anymore? This would leave 5%
time for IRQs, including NIC IRQs
*  if corosync runs at higher priority than the kernel softirq thread
processing NIC IRQ how is corosync expecting incoming packets to be
processed, if it is hogging the CPU by receiving, polling and sending
packets?

On another host which exhibited the same problem I've run strace (which
also had the side-effect of getting corosync unstuck from 100% CPU use
after strace finished):
1 bind
5 close
688 epoll_wait
8 futex
1 getsockname
3 ioctl
1 open
3 recvfrom
190 recvmsg
87 sendmsg
9 sendto
4 socket
6 write

On yet another host I've run gdb while corosync was stuck:
Thread 2 (Thread 0x7f6fd0c9b700 (LWP 16245)):
#0 0x00007f6fd34a0afb in do_futex_wait.constprop.1 ()
from /lib64/libpthread.so.0
#1 0x00007f6fd34a0b8f in __new_sem_wait_slow.constprop.0 ()
from /lib64/libpthread.so.0
#2 0x00007f6fd34a0c2b in sem_wait@@GLIBC_2.2.5 () from
/lib64/libpthread.so.0
#3 0x00007f6fd3b38991 in qb_logt_worker_thread () from /lib64/libqb.so.0
#4 0x00007f6fd349ae25 in start_thread () from /lib64/libpthread.so.0
#5 0x00007f6fd31c4bad in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f6fd43c7b80 (LWP 16242)):
#0 0x00007f6fd31c5183 in epoll_wait () from /lib64/libc.so.6
#1 0x00007f6fd3b3dea8 in poll_and_add_to_jobs () from /lib64/libqb.so.0
#2 0x00007f6fd3b2ed93 in qb_loop_run () from /lib64/libqb.so.0
#3 0x000055592d62ff78 in main ()

[1]
top - 15:51:38 up 47 min,  2 users,  load average: 3.81, 1.70, 0.70
Tasks: 208 total,   4 running, 130 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
0.0 st
%Cpu1  : 53.8 us, 46.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,
0.0 st
%Cpu2  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
0.0 st
%Cpu3  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
0.0 st
%Cpu4  :  3.8 us,  0.0 sy,  0.0 ni, 96.2 id,  0.0 wa,  0.0 hi,  0.0 si,
0.0 st
%Cpu5  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
0.0 st
%Cpu6  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
0.0 st
%Cpu7  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
0.0 st
KiB Mem :  4210184 total,  3374752 free,   387380 used,   448052 buff/cache
KiB Swap:  1048572 total,  1048572 free,        0 used.  3660276 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
16217 root      rt   0  194832  92532  66128 R 104.0  2.2   2:54.00 corosync
24004 root      20   0  161984   4432   3696 R   4.0  0.1   0:00.02 top

 cat /proc/interrupts |grep eth
 88:          2          0          0          0          0          0
       0          0  xen-pirq    -msi-x     eth0
 89:       2805          0          0     419621          0          0
       0          0  xen-pirq    -msi-x     eth0-TxRx-0
 90:          2          0          0          0          0          0
       1          0  xen-pirq    -msi-x     eth1
 91:        435     171086          0          0          0          0
       0          0  xen-pirq    -msi-x     eth1-TxRx-0

[2]
watch cat /proc/softirqs
                    CPU0       CPU1       CPU2       CPU3       CPU4
  CPU5       CPU6       CPU7
          HI:          1          0          0          0          0
     0          0          0
       TIMER:     355339     786951     367148     266583     389591
357184     369577     374880
      NET_TX:          8       1081          1        743          1
     1          2598       2550       1878
       BLOCK:       8166         53         42         13         10
  6563         21       6894
    IRQ_POLL:          0          0          0          0          0
     0          0          0
     TASKLET:        103          6          9         85          0
     0          0          0
       SCHED:     240428     100842      83166      51213      77824
 77270      72074      72261
     HRTIMER:          0          0          0          0          0
     0          0          0
         RCU:     135491     245345     139513     130591     140280
135094     141007     136063

[3]
/proc/sys/kernel/sched_rt_runtime_us
950000
/proc/sys/kernel/sched_rt_period_us
1000000

Best regards,
--Edwin
_______________________________________________
Users mailing list: Users at clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org