[ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Fri Feb 15 06:12:04 EST 2019

On 15/02/2019 10:56, Edwin Török wrote:
> On 15/02/2019 09:31, Christine Caulfield wrote:
>> On 14/02/2019 17:33, Edwin Török wrote:
>>> Hello,
>>>
>>> We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and
>>> noticed a fundamental problem with realtime priorities:
>>> - corosync runs on CPU3, and interrupts for the NIC used by corosync are
>>> also routed to CPU3
>>> - corosync runs with SCHED_RR, ksoftirqd does not (should it?), but
>>> without it packets sent/received from that interface would not get processed
>>> - corosync is in a busy loop using 100% CPU, never giving a chance for
>>> softirqs to be processed (including TIMER and SCHED)
>>>
>>
>>
>> Can you tell me what distribution this is please? 
> This is a not-yet-released development version of XenServer based on
> CentOS 7.5/7.6.
> The kernel is 4.19.19 + patches to make it work well with Xen
> (previously we were using a 4.4.52 + Xen patches and backports kernel)
> 
> The versions of packages are:
> rpm -q libqb corosync dlm sbd kernel
> libqb-1.0.1-6.el7.x86_64
> corosync-2.4.3-13.xs+2.0.0.x86_64
> dlm-4.0.7-1.el7.x86_64
> sbd-1.3.1-7.xs+2.0.0.x86_64
> kernel-4.19.19-5.0.0.x86_64
> 
> Package versions with +xs in version have xenserver specific patches
> applied, libqb is coming straight from upstream CentOS here:
> https://git.centos.org/tree/rpms!libqb.git/fe522aa5e0af26c0cff1170b6d766b5f248778d2
> 
>> There are patches to
>> libqb that should be applied to fix a similar problem in 1.0.1-6 - but
>> that's a RHEL version and kernel 4.19 is not a RHEL 7 kernel, so I just
>> need to be sure that those fixes are in your libqb before going any
> further.
> 
> We have libqb 1.0.1-6 from CentOS, it looks like there is 1.0.1-7 which
> includes an SHM crash fix, is this the one you were refering to, or is
> there an additional patch elsewhere?
> https://git.centos.org/commit/rpms!libqb.git/b5ede72cb0faf5b70ddd504822552fe97bfbbb5e
> 

Thanks. libqb-1.0.1-6 does have the patch I was thinking of - I mainly
wanted to check it wasn't someone else's package that didn't have that
patch in. The SHM patch in -7 fixes a race at shutdown (often seen with
sbd). That shouldn't be a problem because there is a workaround in -6
anyway, and it's not fixing a spin, which is what we have here of course.

Are there any messages in the system logs from either corosync or
related subsystems?

Chrissie

>> Without doubt this is a bug, in normal operation corosync is quite light
>> on CPU.
> 
> Thanks for the help in advance,
> --Edwin
> 
>>
>> Chrissie
>>
>>> This appears to be a priority inversion problem, if corosync runs as
>>> realtime then everything it needs (timers...) should be realtime as
>>> well, otherwise running as realtime guarantees we'll miss the watchdog
>>> deadline, instead of guaranteeing that we process the data before the
>>> deadline.
>>>
>>> Do you have some advice on what the expected realtime priorities would
>>> be for:
>>> - corosync
>>> - sbd
>>> - hard irqs
>>> - soft irqs
>>>
>>> Also would it be possible for corosync to avoid hogging the CPU in libqb?
>>> (Our hypothesis is that if softirqs are not processed then timers
>>> wouldn't work for processes on that CPU either)
>>>
>>> Some more background and analysis:
>>>
>>> We noticed that cluster membership changes were very unstable
>>> (especially as you approach 16 physical host clusters) causing the whole
>>> cluster to fence when adding a single node. This was previously working
>>> fairly reliably with a 4.4 based kernel.
>>>
>>> I've increased SBD timeout to 3600 to be able to investigate the problem
>>> and noticed that corosync was using 100% of CPU3 [1] and I immediately
>>> lost SSH access on eth0 (corosync was using eth1), where eth0's
>>> interrupts were also processed on CPU3 (and irqbalance didn't move it).
>>>
>>> IIUC SCHED_RR tasks should not be able to take up 100% of CPU, according
>>> to [3] it shouldn't be allowed to use more than 95% of CPU.
>>>
>>> Softirqs were not processed at all on CPU3 (see [2], the numbers in the
>>> CPU3 column did not change, the numbers in the other columns did).
>>> Tried decreasing priority of corosync using chrt to 1, which didn't
>>> help. I then increased the priority of ksoftirqd to 50 using chrt, which
>>> immediately solved the CPU usage problem on corosync.
>>>
>>> I tried a simple infinite loop program with realtime priority, but it
>>> didn't reproduce the problems with interrupts getting stuck.
>>>
>>>
>>> Three problems here:
>>> * all softirqs were stuck (not being processed) on CPU3, which included
>>> TIMER and SCHED. corosync relies quite heavily on timers, would lack of
>>> processing them cause the 100% CPU usage?
>>> * is there a kernel bug introduced between 4.4 - 4.19 that causes
>>> realtime tasks to not respect the 95% limit anymore? This would leave 5%
>>> time for IRQs, including NIC IRQs
>>> *  if corosync runs at higher priority than the kernel softirq thread
>>> processing NIC IRQ how is corosync expecting incoming packets to be
>>> processed, if it is hogging the CPU by receiving, polling and sending
>>> packets?
>>>
>>> On another host which exhibited the same problem I've run strace (which
>>> also had the side-effect of getting corosync unstuck from 100% CPU use
>>> after strace finished):
>>> 1 bind
>>> 5 close
>>> 688 epoll_wait
>>> 8 futex
>>> 1 getsockname
>>> 3 ioctl
>>> 1 open
>>> 3 recvfrom
>>> 190 recvmsg
>>> 87 sendmsg
>>> 9 sendto
>>> 4 socket
>>> 6 write
>>>
>>> On yet another host I've run gdb while corosync was stuck:
>>> Thread 2 (Thread 0x7f6fd0c9b700 (LWP 16245)):
>>> #0 0x00007f6fd34a0afb in do_futex_wait.constprop.1 ()
>>> from /lib64/libpthread.so.0
>>> #1 0x00007f6fd34a0b8f in __new_sem_wait_slow.constprop.0 ()
>>> from /lib64/libpthread.so.0
>>> #2 0x00007f6fd34a0c2b in sem_wait@@GLIBC_2.2.5 () from
>>> /lib64/libpthread.so.0
>>> #3 0x00007f6fd3b38991 in qb_logt_worker_thread () from /lib64/libqb.so.0
>>> #4 0x00007f6fd349ae25 in start_thread () from /lib64/libpthread.so.0
>>> #5 0x00007f6fd31c4bad in clone () from /lib64/libc.so.6
>>>
>>> Thread 1 (Thread 0x7f6fd43c7b80 (LWP 16242)):
>>> #0 0x00007f6fd31c5183 in epoll_wait () from /lib64/libc.so.6
>>> #1 0x00007f6fd3b3dea8 in poll_and_add_to_jobs () from /lib64/libqb.so.0
>>> #2 0x00007f6fd3b2ed93 in qb_loop_run () from /lib64/libqb.so.0
>>> #3 0x000055592d62ff78 in main ()
>>>
>>>
>>> [1]
>>> top - 15:51:38 up 47 min,  2 users,  load average: 3.81, 1.70, 0.70
>>> Tasks: 208 total,   4 running, 130 sleeping,   0 stopped,   0 zombie
>>> %Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
>>> 0.0 st
>>> %Cpu1  : 53.8 us, 46.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,
>>> 0.0 st
>>> %Cpu2  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
>>> 0.0 st
>>> %Cpu3  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
>>> 0.0 st
>>> %Cpu4  :  3.8 us,  0.0 sy,  0.0 ni, 96.2 id,  0.0 wa,  0.0 hi,  0.0 si,
>>> 0.0 st
>>> %Cpu5  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
>>> 0.0 st
>>> %Cpu6  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
>>> 0.0 st
>>> %Cpu7  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
>>> 0.0 st
>>> KiB Mem :  4210184 total,  3374752 free,   387380 used,   448052 buff/cache
>>> KiB Swap:  1048572 total,  1048572 free,        0 used.  3660276 avail Mem
>>>
>>>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
>>> 16217 root      rt   0  194832  92532  66128 R 104.0  2.2   2:54.00 corosync
>>> 24004 root      20   0  161984   4432   3696 R   4.0  0.1   0:00.02 top
>>>
>>>  cat /proc/interrupts |grep eth
>>>  88:          2          0          0          0          0          0
>>>        0          0  xen-pirq    -msi-x     eth0
>>>  89:       2805          0          0     419621          0          0
>>>        0          0  xen-pirq    -msi-x     eth0-TxRx-0
>>>  90:          2          0          0          0          0          0
>>>        1          0  xen-pirq    -msi-x     eth1
>>>  91:        435     171086          0          0          0          0
>>>        0          0  xen-pirq    -msi-x     eth1-TxRx-0
>>>
>>> [2]
>>> watch cat /proc/softirqs
>>>                     CPU0       CPU1       CPU2       CPU3       CPU4
>>>   CPU5       CPU6       CPU7
>>>           HI:          1          0          0          0          0
>>>      0          0          0
>>>        TIMER:     355339     786951     367148     266583     389591
>>> 357184     369577     374880
>>>       NET_TX:          8       1081          1        743          1
>>>      1        223          5
>>>       NET_RX:       5807     230818       2709     529755       3323
>>>   2598       2550       1878
>>>        BLOCK:       8166         53         42         13         10
>>>   6563         21       6894
>>>     IRQ_POLL:          0          0          0          0          0
>>>      0          0          0
>>>      TASKLET:        103          6          9         85          0
>>>      0          0          0
>>>        SCHED:     240428     100842      83166      51213      77824
>>>  77270      72074      72261
>>>      HRTIMER:          0          0          0          0          0
>>>      0          0          0
>>>          RCU:     135491     245345     139513     130591     140280
>>> 135094     141007     136063
>>>
>>>
>>> [3]
>>> /proc/sys/kernel/sched_rt_runtime_us
>>> 950000
>>> /proc/sys/kernel/sched_rt_period_us
>>> 1000000
>>>
>>> Best regards,
>>> --Edwin
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>