[ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Tue Feb 26 05:01:39 EST 2019

On 26/02/2019 07:32, Jan Friesse wrote:
> Edwin
>> Török wrote:
>>> Setup: 16 CentOS 7.6 VMs, 4 vCPUs, 4GiB RAM running on XenServer 7.6
>>> (Xen 4.7.6)
>>
>> 2 vCPUs makes this a lot easier to reproduce the lost network
>> connectivity/fencing.
>> 1 vCPU reproduces just the high CPU usage, but with 95% CPU usage on all
>> kernels, and no fencing at all, also after a while the CPU usage drops
>> because it is able to make progress, so 1 vCPU is not a good reproducer.
>>
>>> Host is a Dell Poweredge R430, Xeon E5-2630 v3.
> 
> So do I understand correctly that setup is:
> - 10 HW cores

32 logical CPUs in total on physical host, albeit shared among all the
VMs and Dom0.

> - 16 VMs each with 2 (previously 4) vcpu -> 32 (previously 64) vcpu

Yes

> ?
> 
> How much memory (physical memory) hosts has?
> 

96GiB, out of which 64GiB used by these VMs, and 25.6 GiB free (rest
used by Dom0+Xen).

> Is the problem reproducible when hosts is not that much overcommited
> (let's say 5-10 VMs each with 2 vcpus)?

I can try splitting the load between 2 or 4 identical physical hosts,
will report back.

> 
> 
>>
>> Used exact same host.
>> [...]
>>
>> On some kernels after a hard reboot of all hosts the problem reproduces
>> very easily. On other kernels it takes a few more hard (power cycle the
>> VM), or soft (run 'reboot' command inside the VM) cycles to reproduce.
>> The soft reboot seems to have a higher chance of reproducing the problem.
> 
> Have you got a chance to test with "send_join" set to 200 in
> corosync.conf? It may really help, because the problem you are
> describing really looks like result of all nodes starting at same time,
> and overloading network at exactly same time.
> 
> We already have a bug for bigger clusters (32 nodes) where setting
> set_join helps and it's highly probably going to become default (I need
> to test it and find "formula" to compute value based on token timeout).

Not while reproducing this bug, I've tried it in the past setting it to
50 and didn't help.
I'll try again with 200, thanks for the suggestion.

> 
>>
>> spausedd:
>>
>> The pauses it logs are not always correlated with the high CPU usage.
>> For example on one node it hasn't logged anything today:
>> /var/log/messages-20190224:Feb 20 17:03:17 host-10 spausedd[4239]: Not
>> scheduled for 0.2225s (threshold is 0.2000s), steal time is 0.0000s
>> (0.00%)
>> /var/log/messages-20190224:Feb 20 17:03:19 host-10 spausedd[4239]: Not
>> scheduled for 0.2416s (threshold is 0.2000s), steal time is 0.0000s
>> (0.00%
>>
>> But:
>> 4399 root      rt   0  241380 139072  84732 R 106.7  3.4 183:51.67
>> corosync
>>
> 
> High corosync cpu usage doesn't necessarily mean that spausedd doesn't
> get it's required portion (very small one) of time. You can check
> yourself the spausedd source code
> (https://github.com/jfriesse/spausedd/blob/master/spausedd.c) - it's
> short (specially if you ignore HAVE_VMGUESTLIB parts) and I'm pretty
> sure it works as should be. (It was one of the reason why I wrote it,
> because corosync pause detector may be affected by various components
> bugs (cycle in corosync/bug in libqb), but spausedd shouldn't be)

I wasn't suggesting that spausedd is wrong, but given 2 vCPUs the kernel
is likely able to almost always schedule it on the other one.
I'll try Klaus's suggestion of increasing RT runtime share to 100% and 1
vCPU, that should make spausedd detection more likely.

Best regards,
--Edwin