[ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?
Jan Friesse
jfriesse at redhat.com
Thu Feb 21 08:03:21 UTC 2019
Edwin,
> On 20/02/2019 13:08, Jan Friesse wrote:
>> Edwin Török napsal(a):
>>> On 20/02/2019 07:57, Jan Friesse wrote:
>>>> Edwin,
>>>>>
>>>>>
>>>>> On 19/02/2019 17:02, Klaus Wenninger wrote:
>>>>>> On 02/19/2019 05:41 PM, Edwin Török wrote:
>>>>>>> On 19/02/2019 16:26, Edwin Török wrote:
>>>>>>>> On 18/02/2019 18:27, Edwin Török wrote:
>>>>>>>>> Did a test today with CentOS 7.6 with upstream kernel and with
>>>>>>>>> 4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our
>>>>>>>>> patched [1] SBD) and was not able to reproduce the issue yet.
>>>>>>>> I was able to finally reproduce this using only upstream components
>>>>>>>> (although it seems to be easier to reproduce if we use our patched
>>>>>>>> SBD,
>>>>>>>> I was able to reproduce this by using only upstream packages
>>>>>>>> unpatched
>>>>>>>> by us):
>>>>>>
>>>>>> Just out of curiosity: What did you patch in SBD?
>>>>>> Sorry if I missed the answer in the previous communication.
>>>>>
>>>>> It is mostly this PR, which calls getquorate quite often (a more
>>>>> efficient impl. would be to use the quorum notification API like
>>>>> dlm/pacemaker do, although see concerns in
>>>>> https://lists.clusterlabs.org/pipermail/users/2019-February/016249.html):
>>>>>
>>>>> https://github.com/ClusterLabs/sbd/pull/27
>>>>>
>>>>> We have also added our own servant for watching the health of our
>>>>> control plane, but that is not relevant to this bug (it reproduces with
>>>>> that watcher turned off too).
>>>>>
>>>>>>
>>>>>>> I was also able to get a corosync blackbox from one of the stuck VMs
>>>>>>> that showed something interesting:
>>>>>>> https://clbin.com/d76Ha
>>>>>>>
>>>>>>> It is looping on:
>>>>>>> debug Feb 19 16:37:24 mcast_sendmsg(408):12: sendmsg(mcast) failed
>>>>>>> (non-critical): Resource temporarily unavailable (11)
>>>>>>
>>>>>> Hmm ... something like tx-queue of the device full, or no buffers
>>>>>> available anymore and kernel-thread doing the cleanup isn't
>>>>>> scheduled ...
>>>>>
>>>>> Yes that is very plausible. Perhaps it'd be nicer if corosync went back
>>>>> to the epoll_wait loop when it gets too many EAGAINs from sendmsg.
>>>>
>>>> But this is exactly what happens. Corosync will call sendmsg to all
>>>> active udpu members and returns back to main loop -> epoll_wait.
>>>>
>>>>> (although this seems different from the original bug where it got stuck
>>>>> in epoll_wait)
>>>>
>>>> I'm pretty sure it is.
>>>>
>>>> Anyway, let's try "sched_yield" idea. Could you please try included
>>>> patch and see if it makes any difference (only for udpu)?
>>>
>>> Thanks for the patch, unfortunately corosync still spins 106% even with
>>> yield:
>>> https://clbin.com/CF64x
>>
>> Yep, it was kind of expected, but at lost worth a try. How does strace
>> look when this happens?
>>
>
> strace for the situation described below (corosync 95%, 1 vCPU):
> https://clbin.com/hZL5z
This one again looks quite "normal". If there is problem (95% corosync
CPU usage is problem), it is (or at least looks like) different one than
in first report (full of epoll_wait immediately returning some fd ready).
>
>> Also Klaus had an idea to try remove sbd from the picture and try
>> different RR process to find out what happens. And I think it's again
>> worth try.
>>
>> Could you please try install/enable/start
>> https://github.com/jfriesse/spausedd (packages built by copr are
>> https://copr.fedorainfracloud.org/coprs/honzaf/spausedd/),
>> disable/remove sbd and run your test?
>>
>
> Tried this with 4 vCPUs but it didn't detect any pauses (kind of
> expected, plenty of other CPUs to have spausedd scheduled on even if
> corosync hogs one).
> Then tried with 1 vCPU and it didn't detect any pauses here either. The
> 95% max realtime runtime protection kicks in and limits corosync to 95%
> CPU since now global 95% = this CPU's 95% since there is only one.
> Interestingly corosync stays running at 95% CPU usage now, unlike in SMP
> scenarios where the 95% limit was enough to avoid the situation.
Kind of expected. The reason for this test was to get sbd out of picture
- I'm not convinced we were successful because it looks like we are
solving different problem than before. spausedd is (mainly) for
detecting overloaded vm host.
> There is a visible keyboard lag over SSH, but spausedd does get scheduled:
>
> 4923 root rt 0 213344 111036 84732 R 93.3 2.8 19:07.50
> corosync
>
> 9 root 20 0 0 0 0 S 0.3 0.0 0:02.02
> ksoftirqd/0
>
> 4256 root rt 0 88804 35612 34368 R 0.3 0.9 0:04.36
> spausedd
>
> No SBD running, and corosync-blackbox does not work at all now.
>
That's bad
> ifconfig eth0; sleep 1; ifconfig eth0
> eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
> inet 10.62.98.26 netmask 255.255.240.0 broadcast 10.62.111.255
> inet6 fd06:7768:b9e5:8c50:4c0e:daff:fe14:55f0 prefixlen 64
> scopeid 0x0<global>
> inet6 fe80::4c0e:daff:fe14:55f0 prefixlen 64 scopeid 0x20<link>
> ether 4e:0e:da:14:55:f0 txqueuelen 1000 (Ethernet)
> RX packets 4929031 bytes 4644924223 (4.3 GiB)
> RX errors 0 dropped 0 overruns 0 frame 0
> TX packets 15445220 bytes 22485760076 (20.9 GiB)
> TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
>
> eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
> inet 10.62.98.26 netmask 255.255.240.0 broadcast 10.62.111.255
> inet6 fd06:7768:b9e5:8c50:4c0e:daff:fe14:55f0 prefixlen 64
> scopeid 0x0<global>
> inet6 fe80::4c0e:daff:fe14:55f0 prefixlen 64 scopeid 0x20<link>
> ether 4e:0e:da:14:55:f0 txqueuelen 1000 (Ethernet)
> RX packets 4934042 bytes 4649642415 (4.3 GiB)
> RX errors 0 dropped 0 overruns 0 frame 0
> TX packets 15464987 bytes 22514562910 (20.9 GiB)
> TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
>
> Looking at this from the hypervisor side the VM is not sending many
> packets once CPU usage goes up:
> https://pasteboard.co/I246a1W.png
>
>>>
>>> On another host corosync failed to start up completely (Denied
>>> connection not ready), and:
>>> https://clbin.com/Z35Gl
>>> (I don't think this is related to the patch, it was doing that before
>>> when I looked at it this morning, kernel 4.20.0 this time)
>>
>> This one looks kind of normal and I'm pretty sure it's unrelated (I've
>> seen it already sadly never was able to find a "reliable" reproducer)
>
> Got a fairly reliable one now, only 3/16 VMs were able to start
> corosync, one got stuck with CPU usage as shown above, the others failed
> to start with 'Denied connection'. >
> In the past we've seen this kind of problem when dhclient happened to
> take up and listen on corosync's port, but thats been fixed in recent
> CentOS 7.x releases, and is not the case now.
>
> Anything you'd like me to try help debug this?
Just to summarize how I understand current situation:
- Initial report behavior was very different from what you see now
- It looks like currently biggest problem is unability to create membership.
For now, I would suggest to try:
- corosync without sched_yield patch
- set send_join in corosync.conf to 200. It was reported that this could
help to form a bigger clusters.
Honza
>
> Thanks,
> --Edwin
>
More information about the Users
mailing list