[ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?
Edwin Török
edvin.torok at citrix.com
Wed Feb 20 12:37:19 EST 2019
On 20/02/2019 13:08, Jan Friesse wrote:
> Edwin Török napsal(a):
>> On 20/02/2019 07:57, Jan Friesse wrote:
>>> Edwin,
>>>>
>>>>
>>>> On 19/02/2019 17:02, Klaus Wenninger wrote:
>>>>> On 02/19/2019 05:41 PM, Edwin Török wrote:
>>>>>> On 19/02/2019 16:26, Edwin Török wrote:
>>>>>>> On 18/02/2019 18:27, Edwin Török wrote:
>>>>>>>> Did a test today with CentOS 7.6 with upstream kernel and with
>>>>>>>> 4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our
>>>>>>>> patched [1] SBD) and was not able to reproduce the issue yet.
>>>>>>> I was able to finally reproduce this using only upstream components
>>>>>>> (although it seems to be easier to reproduce if we use our patched
>>>>>>> SBD,
>>>>>>> I was able to reproduce this by using only upstream packages
>>>>>>> unpatched
>>>>>>> by us):
>>>>>
>>>>> Just out of curiosity: What did you patch in SBD?
>>>>> Sorry if I missed the answer in the previous communication.
>>>>
>>>> It is mostly this PR, which calls getquorate quite often (a more
>>>> efficient impl. would be to use the quorum notification API like
>>>> dlm/pacemaker do, although see concerns in
>>>> https://lists.clusterlabs.org/pipermail/users/2019-February/016249.html):
>>>>
>>>> https://github.com/ClusterLabs/sbd/pull/27
>>>>
>>>> We have also added our own servant for watching the health of our
>>>> control plane, but that is not relevant to this bug (it reproduces with
>>>> that watcher turned off too).
>>>>
>>>>>
>>>>>> I was also able to get a corosync blackbox from one of the stuck VMs
>>>>>> that showed something interesting:
>>>>>> https://clbin.com/d76Ha
>>>>>>
>>>>>> It is looping on:
>>>>>> debug Feb 19 16:37:24 mcast_sendmsg(408):12: sendmsg(mcast) failed
>>>>>> (non-critical): Resource temporarily unavailable (11)
>>>>>
>>>>> Hmm ... something like tx-queue of the device full, or no buffers
>>>>> available anymore and kernel-thread doing the cleanup isn't
>>>>> scheduled ...
>>>>
>>>> Yes that is very plausible. Perhaps it'd be nicer if corosync went back
>>>> to the epoll_wait loop when it gets too many EAGAINs from sendmsg.
>>>
>>> But this is exactly what happens. Corosync will call sendmsg to all
>>> active udpu members and returns back to main loop -> epoll_wait.
>>>
>>>> (although this seems different from the original bug where it got stuck
>>>> in epoll_wait)
>>>
>>> I'm pretty sure it is.
>>>
>>> Anyway, let's try "sched_yield" idea. Could you please try included
>>> patch and see if it makes any difference (only for udpu)?
>>
>> Thanks for the patch, unfortunately corosync still spins 106% even with
>> yield:
>> https://clbin.com/CF64x
>
> Yep, it was kind of expected, but at lost worth a try. How does strace
> look when this happens?
>
strace for the situation described below (corosync 95%, 1 vCPU):
https://clbin.com/hZL5z
> Also Klaus had an idea to try remove sbd from the picture and try
> different RR process to find out what happens. And I think it's again
> worth try.
>
> Could you please try install/enable/start
> https://github.com/jfriesse/spausedd (packages built by copr are
> https://copr.fedorainfracloud.org/coprs/honzaf/spausedd/),
> disable/remove sbd and run your test?
>
Tried this with 4 vCPUs but it didn't detect any pauses (kind of
expected, plenty of other CPUs to have spausedd scheduled on even if
corosync hogs one).
Then tried with 1 vCPU and it didn't detect any pauses here either. The
95% max realtime runtime protection kicks in and limits corosync to 95%
CPU since now global 95% = this CPU's 95% since there is only one.
Interestingly corosync stays running at 95% CPU usage now, unlike in SMP
scenarios where the 95% limit was enough to avoid the situation.
There is a visible keyboard lag over SSH, but spausedd does get scheduled:
4923 root rt 0 213344 111036 84732 R 93.3 2.8 19:07.50
corosync
9 root 20 0 0 0 0 S 0.3 0.0 0:02.02
ksoftirqd/0
4256 root rt 0 88804 35612 34368 R 0.3 0.9 0:04.36
spausedd
No SBD running, and corosync-blackbox does not work at all now.
ifconfig eth0; sleep 1; ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.62.98.26 netmask 255.255.240.0 broadcast 10.62.111.255
inet6 fd06:7768:b9e5:8c50:4c0e:daff:fe14:55f0 prefixlen 64
scopeid 0x0<global>
inet6 fe80::4c0e:daff:fe14:55f0 prefixlen 64 scopeid 0x20<link>
ether 4e:0e:da:14:55:f0 txqueuelen 1000 (Ethernet)
RX packets 4929031 bytes 4644924223 (4.3 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 15445220 bytes 22485760076 (20.9 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.62.98.26 netmask 255.255.240.0 broadcast 10.62.111.255
inet6 fd06:7768:b9e5:8c50:4c0e:daff:fe14:55f0 prefixlen 64
scopeid 0x0<global>
inet6 fe80::4c0e:daff:fe14:55f0 prefixlen 64 scopeid 0x20<link>
ether 4e:0e:da:14:55:f0 txqueuelen 1000 (Ethernet)
RX packets 4934042 bytes 4649642415 (4.3 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 15464987 bytes 22514562910 (20.9 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Looking at this from the hypervisor side the VM is not sending many
packets once CPU usage goes up:
https://pasteboard.co/I246a1W.png
>>
>> On another host corosync failed to start up completely (Denied
>> connection not ready), and:
>> https://clbin.com/Z35Gl
>> (I don't think this is related to the patch, it was doing that before
>> when I looked at it this morning, kernel 4.20.0 this time)
>
> This one looks kind of normal and I'm pretty sure it's unrelated (I've
> seen it already sadly never was able to find a "reliable" reproducer)
Got a fairly reliable one now, only 3/16 VMs were able to start
corosync, one got stuck with CPU usage as shown above, the others failed
to start with 'Denied connection'.
In the past we've seen this kind of problem when dhclient happened to
take up and listen on corosync's port, but thats been fixed in recent
CentOS 7.x releases, and is not the case now.
Anything you'd like me to try help debug this?
Thanks,
--Edwin
More information about the Users
mailing list