[ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?
Klaus Wenninger
kwenning at redhat.com
Mon Feb 25 14:59:56 UTC 2019
On 02/25/2019 03:10 PM, Edwin Török wrote:
> Hi,
>
> I've done some more tests and I am now able to reproduce the 100% CPU
> usage / infinite loop of corosync/libqb on all the kernels that I
> tested. Therefore I think this is NOT a regression in 4.19 kernel.
>
> A suitable workaround for 4.19+ kernels is to run this on startup (e.g.
> from tmpfiles.d):
> echo NO_RT_RUNTIME_SHARE >/sys/kernel/debug/sched_features
> This makes the 4.19 kernel behave similarly wrt to earlier ones, where
> high CPU usage did not prevent softirq processing.
>
> The difference between these kernels is whether the 100% CPU usage also
> causes the host to fence or loose network connectivity: with 4.19.x it
> does, with 4.4.x, 3.10.x it doesn't. I am looking at a 4.4.x kernel now
> with 100% CPU usage on 8 out of 16 hosts running since 2h30m now for
> example.
>
> Kernels I've tried:
> * 3.10.0-957.5.1.e7 (default CentOS 7.6)
> * kernel-ml-4.20.12-1.el7.elrepo.x86_64 (elrepo)
> * v4.20.10 (manual build with 'make binpkg-rpm' -j32 with .config from
> 4.20.10-100.fc28 and DEBUG_INFO disabled. With DEBUG_INFO enabled rpm on
> CentOS7 failed to install the package)
> * various attempted bisects:
> kernel-4.19.16-200.fc28.x86_64.rpm
> kernel-4.20.0-1.fc30.x86_64.rpm
> kernel-4.20.5_00059_g7e66208-4.x86_64.rpm
> kernel-4.20.5_00089_g44486b2-5.x86_64.rpm
> kernel-4.20.8_00026_g1699a0c-7.x86_64.rpm
> kernel-4.20.8_00039_gf76ee75-8.x86_64.rpm
> kernel-4.20.8_00046_g196ffed-9.x86_64.rpm
> kernel-4.20.8_00049_g54dbbb7-10.x86_64.rpm
> kernel-4.20.9-11.x86_64.rpm
> kernel-4.20.9_00001_g90aa9a7-12.x86_64.rpm
> kernel-4.4.52-4.0.12.x86_64.rpm (XenServer Lima)
> kernel-4.19.19-5.0.1.x86_64.rpm (XenServer master)
>
> The updated repro steps are:
>
> On 19/02/2019 16:26, Edwin Török wrote:> On 18/02/2019 18:27, Edwin
> Török wrote:
>> Setup: 16 CentOS 7.6 VMs, 4 vCPUs, 4GiB RAM running on XenServer 7.6
>> (Xen 4.7.6)
> 2 vCPUs makes this a lot easier to reproduce the lost network
> connectivity/fencing.
> 1 vCPU reproduces just the high CPU usage, but with 95% CPU usage on all
> kernels, and no fencing at all, also after a while the CPU usage drops
> because it is able to make progress, so 1 vCPU is not a good reproducer.
That actually all fits the theory quite well.
The lower the number of cores the higher the chance that the
thread running wild hits the core that has certain vital stuff
bound to.
If you have RT_RUNTIME_SHARE enabled - as I got it - the
5% reservation isn't gonna be done on a per core-basis anymore
but globally so that it is possible that a core gets hogged to 100%.
If you are down to a single core it of course doesn't make any
difference anymore if RT_RUNTIME_SHARE is enabled or not.
Having all this in mind the best reproducer then might be
a single core with non-rt-reservation disabled.
>
>> Host is a Dell Poweredge R430, Xeon E5-2630 v3.
> Used exact same host.
>
>> On each VM:
>> # yum install -y corosync dlm pcs pacemaker fence-agents-all sbd
>> # echo mypassword | passwd hacluster --stdin
>> # systemctl enable --now pcsd
>> # echo xen_wdt >/etc/modules-load.d/watchdog.conf
>> # modprobe xen_wdt
>> # hostnamectl set-hostname host-<ip-address>
>>
>> On one host:
>> # pcs cluster auth -u hacluster -p xenroot <allips>
>> # pcs cluster setup --name cluster --auto_tie_breaker=1 <allips>
>>
>> # pcs stonith sbd enable
>> # pcs cluster enable --all
>> # pcs cluster start --all
>> # pcs property set no-quorum-policy=freeze
>> # pcs resource create dlm ocf:pacemaker:controld op monitor interval=30s
>> on-fail=fence clone interleave=true ordered=true
>> # pcs property set stonith-watchdog-timeout=10s
> I also installed spausedd mentioned earlier, see below for details.
>
>> In a loop on this host:
>> # while true; do pcs cluster stop; pcs cluster start; corosync-cfgtool
>> -R; done
>>
> This loop is not a good enough reproducer, it led me down some wrong
> paths during git bisection, mistakenly showing a kernel as not
> triggering the bug when in fact it was.
> Sometimes the loop shows an increased CPU usage of ~50% on many nodes,
> but when the loop is stopped this stops too. Sometimes the loop shows no
> noticable increase in CPU usage.
>
> A better one is:
> * On all hosts run:
> pcs cluster stop --force
> * Wait until they stop (might have to `kill -9 $(pidof corosync)` if any
> is already stuck
> * Install new kernel (yum install -y kernel, yum install -y kernel-ml,
> rpm -Uhv --force kernel-..., etc.)
> * rm -f /boot/grub2/grubenv
> * sync
> * reboot
> * Run corosync-quorumtool and crm_mon to check node health
>
> When all nodes have booted and have corosync running you usually end up
> in these scenarios:
> 1. The node takes a long time to boot until you get an SSH or vga/serial
> console shell. Wait, the boot scripts have to timeout, likely related to
> corosync failing to start up.
> 2. All nodes quorate, and crm_mon shows dlm active on all hosts. Bug
> failed to reproduce, try again with a reboot (either hard reboot, or
> clean reboot of all hosts in parallel). It is quite rare you end up in
> this situation.
> 3. Corosync/crmd both using a lot of CPU. Wait, crmd usually finishes
> and you end up in one of the other scenarios
> 4. Corosync using 50% of a core. Wait, it usually either finishes or
> goes 100%.
> 5. Corosync is not running, and `pcs cluster start` fails due to 'Denied
> connection, not ready'. Try 'pcs cluster start' again in a loop until
> you reach one of the other scenarios
> 6. Corosync running with high CPU usage on some nodes (100%). If it
> stays like that for several minutes the bug was reproduced (whether the
> node fences or not)
>
> On some kernels after a hard reboot of all hosts the problem reproduces
> very easily. On other kernels it takes a few more hard (power cycle the
> VM), or soft (run 'reboot' command inside the VM) cycles to reproduce.
> The soft reboot seems to have a higher chance of reproducing the problem.
>
> spausedd:
>
> The pauses it logs are not always correlated with the high CPU usage.
> For example on one node it hasn't logged anything today:
> /var/log/messages-20190224:Feb 20 17:03:17 host-10 spausedd[4239]: Not
> scheduled for 0.2225s (threshold is 0.2000s), steal time is 0.0000s (0.00%)
> /var/log/messages-20190224:Feb 20 17:03:19 host-10 spausedd[4239]: Not
> scheduled for 0.2416s (threshold is 0.2000s), steal time is 0.0000s (0.00%
>
> But:
> 4399 root rt 0 241380 139072 84732 R 106.7 3.4 183:51.67 corosync
>
> On other nodes it does log problems, but only every once in a while, for
> example these are all the logs from today from this other node:
> /var/log/messages:Feb 25 10:07:45 host-10 spausedd[3707]: Not scheduled
> for 0.4692s (threshold is 0.2000s), steal time is 0.2800s (59.68%)
>
> /var/log/messages:Feb 25 10:07:45 host-10 spausedd[3707]: Steal time is
>> 10.0%, this is usually because of overloaded host machine
> /var/log/messages:Feb 25 10:16:34 host-10 spausedd[3707]: Not scheduled
> for 0.4903s (threshold is 0.2000s), steal time is 0.3400s (69.34%)
>
> /var/log/messages:Feb 25 10:16:34 host-10 spausedd[3707]: Steal time is
>> 10.0%, this is usually because of overloaded host machine
> /var/log/messages:Feb 25 10:42:21 host-10 spausedd[3720]: Not scheduled
> for 0.2840s (threshold is 0.2000s), steal time is 0.1600s (56.34%)
>
> /var/log/messages:Feb 25 10:42:21 host-10 spausedd[3720]: Steal time is
>> 10.0%, this is usually because of overloaded host machine
> /var/log/messages:Feb 25 10:59:00 host-10 spausedd[3729]: Not scheduled
> for 0.4808s (threshold is 0.2000s), steal time is 0.3700s (76.95%)
>
> /var/log/messages:Feb 25 10:59:00 host-10 spausedd[3729]: Steal time is
>> 10.0%, this is usually because of overloaded host machine
> /var/log/messages:Feb 25 11:07:53 host-10 spausedd[3729]: Not scheduled
> for 0.2448s (threshold is 0.2000s), steal time is 0.1300s (53.10%)
>
> /var/log/messages:Feb 25 11:07:53 host-10 spausedd[3729]: Steal time is
>> 10.0%, this is usually because of overloaded host machine
> Best regards,
> --Edwin
More information about the Users
mailing list