[ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?
Jan Friesse
jfriesse at redhat.com
Tue Feb 26 07:32:01 UTC 2019
Edwin
> Hi,
>
> I've done some more tests and I am now able to reproduce the 100% CPU
> usage / infinite loop of corosync/libqb on all the kernels that I
> tested. Therefore I think this is NOT a regression in 4.19 kernel.
That's good
>
> A suitable workaround for 4.19+ kernels is to run this on startup (e.g.
> from tmpfiles.d):
> echo NO_RT_RUNTIME_SHARE >/sys/kernel/debug/sched_features
> This makes the 4.19 kernel behave similarly wrt to earlier ones, where
> high CPU usage did not prevent softirq processing.
>
> The difference between these kernels is whether the 100% CPU usage also
> causes the host to fence or loose network connectivity: with 4.19.x it
> does, with 4.4.x, 3.10.x it doesn't. I am looking at a 4.4.x kernel now
> with 100% CPU usage on 8 out of 16 hosts running since 2h30m now for
> example.
>
> Kernels I've tried:
> * 3.10.0-957.5.1.e7 (default CentOS 7.6)
> * kernel-ml-4.20.12-1.el7.elrepo.x86_64 (elrepo)
> * v4.20.10 (manual build with 'make binpkg-rpm' -j32 with .config from
> 4.20.10-100.fc28 and DEBUG_INFO disabled. With DEBUG_INFO enabled rpm on
> CentOS7 failed to install the package)
> * various attempted bisects:
> kernel-4.19.16-200.fc28.x86_64.rpm
> kernel-4.20.0-1.fc30.x86_64.rpm
> kernel-4.20.5_00059_g7e66208-4.x86_64.rpm
> kernel-4.20.5_00089_g44486b2-5.x86_64.rpm
> kernel-4.20.8_00026_g1699a0c-7.x86_64.rpm
> kernel-4.20.8_00039_gf76ee75-8.x86_64.rpm
> kernel-4.20.8_00046_g196ffed-9.x86_64.rpm
> kernel-4.20.8_00049_g54dbbb7-10.x86_64.rpm
> kernel-4.20.9-11.x86_64.rpm
> kernel-4.20.9_00001_g90aa9a7-12.x86_64.rpm
> kernel-4.4.52-4.0.12.x86_64.rpm (XenServer Lima)
> kernel-4.19.19-5.0.1.x86_64.rpm (XenServer master)
>
> The updated repro steps are:
>
> On 19/02/2019 16:26, Edwin Török wrote:> On 18/02/2019 18:27, Edwin
> Török wrote:
>> Setup: 16 CentOS 7.6 VMs, 4 vCPUs, 4GiB RAM running on XenServer 7.6
>> (Xen 4.7.6)
>
> 2 vCPUs makes this a lot easier to reproduce the lost network
> connectivity/fencing.
> 1 vCPU reproduces just the high CPU usage, but with 95% CPU usage on all
> kernels, and no fencing at all, also after a while the CPU usage drops
> because it is able to make progress, so 1 vCPU is not a good reproducer.
>
>> Host is a Dell Poweredge R430, Xeon E5-2630 v3.
So do I understand correctly that setup is:
- 10 HW cores
- 16 VMs each with 2 (previously 4) vcpu -> 32 (previously 64) vcpu
?
How much memory (physical memory) hosts has?
Is the problem reproducible when hosts is not that much overcommited
(let's say 5-10 VMs each with 2 vcpus)?
>
> Used exact same host.
>
>>
>> On each VM:
>> # yum install -y corosync dlm pcs pacemaker fence-agents-all sbd
>> # echo mypassword | passwd hacluster --stdin
>> # systemctl enable --now pcsd
>> # echo xen_wdt >/etc/modules-load.d/watchdog.conf
>> # modprobe xen_wdt
>> # hostnamectl set-hostname host-<ip-address>
>>
>> On one host:
>> # pcs cluster auth -u hacluster -p xenroot <allips>
>> # pcs cluster setup --name cluster --auto_tie_breaker=1 <allips>
>>
>> # pcs stonith sbd enable
>> # pcs cluster enable --all
>> # pcs cluster start --all
>> # pcs property set no-quorum-policy=freeze
>> # pcs resource create dlm ocf:pacemaker:controld op monitor interval=30s
>> on-fail=fence clone interleave=true ordered=true
>> # pcs property set stonith-watchdog-timeout=10s
>
> I also installed spausedd mentioned earlier, see below for details.
>
>>
>> In a loop on this host:
>> # while true; do pcs cluster stop; pcs cluster start; corosync-cfgtool
>> -R; done
>>
>
> This loop is not a good enough reproducer, it led me down some wrong
> paths during git bisection, mistakenly showing a kernel as not
> triggering the bug when in fact it was.
> Sometimes the loop shows an increased CPU usage of ~50% on many nodes,
> but when the loop is stopped this stops too. Sometimes the loop shows no
> noticable increase in CPU usage.
>
> A better one is:
> * On all hosts run:
> pcs cluster stop --force
> * Wait until they stop (might have to `kill -9 $(pidof corosync)` if any
> is already stuck
> * Install new kernel (yum install -y kernel, yum install -y kernel-ml,
> rpm -Uhv --force kernel-..., etc.)
> * rm -f /boot/grub2/grubenv
> * sync
> * reboot
> * Run corosync-quorumtool and crm_mon to check node health
>
> When all nodes have booted and have corosync running you usually end up
> in these scenarios:
> 1. The node takes a long time to boot until you get an SSH or vga/serial
> console shell. Wait, the boot scripts have to timeout, likely related to
> corosync failing to start up.
> 2. All nodes quorate, and crm_mon shows dlm active on all hosts. Bug
> failed to reproduce, try again with a reboot (either hard reboot, or
> clean reboot of all hosts in parallel). It is quite rare you end up in
> this situation.
> 3. Corosync/crmd both using a lot of CPU. Wait, crmd usually finishes
> and you end up in one of the other scenarios
> 4. Corosync using 50% of a core. Wait, it usually either finishes or
> goes 100%.
> 5. Corosync is not running, and `pcs cluster start` fails due to 'Denied
> connection, not ready'. Try 'pcs cluster start' again in a loop until
> you reach one of the other scenarios
> 6. Corosync running with high CPU usage on some nodes (100%). If it
> stays like that for several minutes the bug was reproduced (whether the
> node fences or not)
>
> On some kernels after a hard reboot of all hosts the problem reproduces
> very easily. On other kernels it takes a few more hard (power cycle the
> VM), or soft (run 'reboot' command inside the VM) cycles to reproduce.
> The soft reboot seems to have a higher chance of reproducing the problem.
Have you got a chance to test with "send_join" set to 200 in
corosync.conf? It may really help, because the problem you are
describing really looks like result of all nodes starting at same time,
and overloading network at exactly same time.
We already have a bug for bigger clusters (32 nodes) where setting
set_join helps and it's highly probably going to become default (I need
to test it and find "formula" to compute value based on token timeout).
>
> spausedd:
>
> The pauses it logs are not always correlated with the high CPU usage.
> For example on one node it hasn't logged anything today:
> /var/log/messages-20190224:Feb 20 17:03:17 host-10 spausedd[4239]: Not
> scheduled for 0.2225s (threshold is 0.2000s), steal time is 0.0000s (0.00%)
> /var/log/messages-20190224:Feb 20 17:03:19 host-10 spausedd[4239]: Not
> scheduled for 0.2416s (threshold is 0.2000s), steal time is 0.0000s (0.00%
>
> But:
> 4399 root rt 0 241380 139072 84732 R 106.7 3.4 183:51.67 corosync
>
High corosync cpu usage doesn't necessarily mean that spausedd doesn't
get it's required portion (very small one) of time. You can check
yourself the spausedd source code
(https://github.com/jfriesse/spausedd/blob/master/spausedd.c) - it's
short (specially if you ignore HAVE_VMGUESTLIB parts) and I'm pretty
sure it works as should be. (It was one of the reason why I wrote it,
because corosync pause detector may be affected by various components
bugs (cycle in corosync/bug in libqb), but spausedd shouldn't be)
> On other nodes it does log problems, but only every once in a while, for
> example these are all the logs from today from this other node:
> /var/log/messages:Feb 25 10:07:45 host-10 spausedd[3707]: Not scheduled
> for 0.4692s (threshold is 0.2000s), steal time is 0.2800s (59.68%)
>
> /var/log/messages:Feb 25 10:07:45 host-10 spausedd[3707]: Steal time is
>> 10.0%, this is usually because of overloaded host machine
>
> /var/log/messages:Feb 25 10:16:34 host-10 spausedd[3707]: Not scheduled
> for 0.4903s (threshold is 0.2000s), steal time is 0.3400s (69.34%)
>
> /var/log/messages:Feb 25 10:16:34 host-10 spausedd[3707]: Steal time is
>> 10.0%, this is usually because of overloaded host machine
>
> /var/log/messages:Feb 25 10:42:21 host-10 spausedd[3720]: Not scheduled
> for 0.2840s (threshold is 0.2000s), steal time is 0.1600s (56.34%)
>
> /var/log/messages:Feb 25 10:42:21 host-10 spausedd[3720]: Steal time is
>> 10.0%, this is usually because of overloaded host machine
>
> /var/log/messages:Feb 25 10:59:00 host-10 spausedd[3729]: Not scheduled
> for 0.4808s (threshold is 0.2000s), steal time is 0.3700s (76.95%)
>
> /var/log/messages:Feb 25 10:59:00 host-10 spausedd[3729]: Steal time is
>> 10.0%, this is usually because of overloaded host machine
>
> /var/log/messages:Feb 25 11:07:53 host-10 spausedd[3729]: Not scheduled
> for 0.2448s (threshold is 0.2000s), steal time is 0.1300s (53.10%)
>
> /var/log/messages:Feb 25 11:07:53 host-10 spausedd[3729]: Steal time is
>> 10.0%, this is usually because of overloaded host machine
>
As stated in comment above, could you please try to reproduce the
problem with not so much overloaded host?
Honza
> Best regards,
> --Edwin
>
More information about the Users
mailing list