[ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Mon Feb 25 09:10:14 EST 2019

Hi,

I've done some more tests and I am now able to reproduce the 100% CPU
usage / infinite loop of corosync/libqb on all the kernels that I
tested. Therefore I think this is NOT a regression in 4.19 kernel.

A suitable workaround for 4.19+ kernels is to run this on startup (e.g.
from tmpfiles.d):
echo NO_RT_RUNTIME_SHARE >/sys/kernel/debug/sched_features
This makes the 4.19 kernel behave similarly wrt to earlier ones, where
high CPU usage did not prevent softirq processing.

The difference between these kernels is whether the 100% CPU usage also
causes the host to fence or loose network connectivity: with 4.19.x it
does, with 4.4.x, 3.10.x it doesn't. I am looking at a 4.4.x kernel now
with 100% CPU usage on 8 out of 16 hosts running since 2h30m now for
example.

Kernels I've tried:
* 3.10.0-957.5.1.e7 (default CentOS 7.6)
* kernel-ml-4.20.12-1.el7.elrepo.x86_64 (elrepo)
* v4.20.10 (manual build with 'make binpkg-rpm' -j32 with .config from
4.20.10-100.fc28 and DEBUG_INFO disabled. With DEBUG_INFO enabled rpm on
CentOS7 failed to install the package)
* various attempted bisects:
kernel-4.19.16-200.fc28.x86_64.rpm
kernel-4.20.0-1.fc30.x86_64.rpm
kernel-4.20.5_00059_g7e66208-4.x86_64.rpm
kernel-4.20.5_00089_g44486b2-5.x86_64.rpm
kernel-4.20.8_00026_g1699a0c-7.x86_64.rpm
kernel-4.20.8_00039_gf76ee75-8.x86_64.rpm
kernel-4.20.8_00046_g196ffed-9.x86_64.rpm
kernel-4.20.8_00049_g54dbbb7-10.x86_64.rpm
kernel-4.20.9-11.x86_64.rpm
kernel-4.20.9_00001_g90aa9a7-12.x86_64.rpm
kernel-4.4.52-4.0.12.x86_64.rpm (XenServer Lima)
kernel-4.19.19-5.0.1.x86_64.rpm (XenServer master)

The updated repro steps are:

On 19/02/2019 16:26, Edwin Török wrote:> On 18/02/2019 18:27, Edwin
Török wrote:
> Setup: 16 CentOS 7.6 VMs, 4 vCPUs, 4GiB RAM running on XenServer 7.6
> (Xen 4.7.6)

2 vCPUs makes this a lot easier to reproduce the lost network
connectivity/fencing.
1 vCPU reproduces just the high CPU usage, but with 95% CPU usage on all
kernels, and no fencing at all, also after a while the CPU usage drops
because it is able to make progress, so 1 vCPU is not a good reproducer.

> Host is a Dell Poweredge R430, Xeon E5-2630 v3.

Used exact same host.

>
> On each VM:
> # yum install -y corosync dlm pcs pacemaker fence-agents-all sbd
> # echo mypassword | passwd hacluster --stdin
> # systemctl enable --now pcsd
> # echo xen_wdt >/etc/modules-load.d/watchdog.conf
> # modprobe xen_wdt
> # hostnamectl set-hostname host-<ip-address>
>
> On one host:
> # pcs cluster auth -u hacluster -p xenroot <allips>
> # pcs cluster setup --name cluster --auto_tie_breaker=1 <allips>
>
> # pcs stonith sbd enable
> # pcs cluster enable --all
> # pcs cluster start --all
> # pcs property set no-quorum-policy=freeze
> # pcs resource create dlm ocf:pacemaker:controld op monitor interval=30s
> on-fail=fence clone interleave=true ordered=true
> # pcs property set stonith-watchdog-timeout=10s

I also installed spausedd mentioned earlier, see below for details.

>
> In a loop on this host:
> # while true; do pcs cluster stop; pcs cluster start; corosync-cfgtool
> -R; done
>

This loop is not a good enough reproducer, it led me down some wrong
paths during git bisection, mistakenly showing a kernel as not
triggering the bug when in fact it was.
Sometimes the loop shows an increased CPU usage of ~50% on many nodes,
but when the loop is stopped this stops too. Sometimes the loop shows no
noticable increase in CPU usage.

A better one is:
* On all hosts run:
pcs cluster stop --force
* Wait until they stop (might have to `kill -9 $(pidof corosync)` if any
is already stuck
* Install new kernel (yum install -y kernel, yum install -y kernel-ml,
rpm -Uhv --force kernel-..., etc.)
* rm -f /boot/grub2/grubenv
* sync
* reboot
* Run corosync-quorumtool and crm_mon to check node health

When all nodes have booted and have corosync running you usually end up
in these scenarios:
1. The node takes a long time to boot until you get an SSH or vga/serial
console shell. Wait, the boot scripts have to timeout, likely related to
corosync failing to start up.
2. All nodes quorate, and crm_mon shows dlm active on all hosts. Bug
failed to reproduce, try again with a reboot (either hard reboot, or
clean reboot of all hosts in parallel). It is quite rare you end up in
this situation.
3. Corosync/crmd both using a lot of CPU. Wait, crmd usually finishes
and you end up in one of the other scenarios
4. Corosync using 50% of a core. Wait, it usually either finishes or
goes 100%.
5. Corosync is not running, and `pcs cluster start` fails due to 'Denied
connection, not ready'. Try 'pcs cluster start' again in a loop until
you reach one of the other scenarios
6. Corosync running with high CPU usage on some nodes (100%). If it
stays like that for several minutes the bug was reproduced (whether the
node fences or not)

On some kernels after a hard reboot of all hosts the problem reproduces
very easily. On other kernels it takes a few more hard (power cycle the
VM), or soft (run 'reboot' command inside the VM) cycles to reproduce.
The soft reboot seems to have a higher chance of reproducing the problem.

spausedd:

The pauses it logs are not always correlated with the high CPU usage.
For example on one node it hasn't logged anything today:
/var/log/messages-20190224:Feb 20 17:03:17 host-10 spausedd[4239]: Not
scheduled for 0.2225s (threshold is 0.2000s), steal time is 0.0000s (0.00%)
/var/log/messages-20190224:Feb 20 17:03:19 host-10 spausedd[4239]: Not
scheduled for 0.2416s (threshold is 0.2000s), steal time is 0.0000s (0.00%

But:
4399 root      rt   0  241380 139072  84732 R 106.7  3.4 183:51.67 corosync

On other nodes it does log problems, but only every once in a while, for
example these are all the logs from today from this other node:
/var/log/messages:Feb 25 10:07:45 host-10 spausedd[3707]: Not scheduled
for 0.4692s (threshold is 0.2000s), steal time is 0.2800s (59.68%)

/var/log/messages:Feb 25 10:07:45 host-10 spausedd[3707]: Steal time is
> 10.0%, this is usually because of overloaded host machine

/var/log/messages:Feb 25 10:16:34 host-10 spausedd[3707]: Not scheduled
for 0.4903s (threshold is 0.2000s), steal time is 0.3400s (69.34%)

/var/log/messages:Feb 25 10:16:34 host-10 spausedd[3707]: Steal time is
> 10.0%, this is usually because of overloaded host machine

/var/log/messages:Feb 25 10:42:21 host-10 spausedd[3720]: Not scheduled
for 0.2840s (threshold is 0.2000s), steal time is 0.1600s (56.34%)

/var/log/messages:Feb 25 10:42:21 host-10 spausedd[3720]: Steal time is
> 10.0%, this is usually because of overloaded host machine

/var/log/messages:Feb 25 10:59:00 host-10 spausedd[3729]: Not scheduled
for 0.4808s (threshold is 0.2000s), steal time is 0.3700s (76.95%)

/var/log/messages:Feb 25 10:59:00 host-10 spausedd[3729]: Steal time is
> 10.0%, this is usually because of overloaded host machine

/var/log/messages:Feb 25 11:07:53 host-10 spausedd[3729]: Not scheduled
for 0.2448s (threshold is 0.2000s), steal time is 0.1300s (53.10%)

/var/log/messages:Feb 25 11:07:53 host-10 spausedd[3729]: Steal time is
> 10.0%, this is usually because of overloaded host machine

Best regards,
--Edwin