[ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Klaus Wenninger kwenning at redhat.com
Wed Feb 20 15:25:09 EST 2019


On 02/20/2019 06:37 PM, Edwin Török wrote:
> On 20/02/2019 13:08, Jan Friesse wrote:
>> Edwin Török napsal(a):
>>> On 20/02/2019 07:57, Jan Friesse wrote:
>>>> Edwin,
>>>>>
>>>>> On 19/02/2019 17:02, Klaus Wenninger wrote:
>>>>>> On 02/19/2019 05:41 PM, Edwin Török wrote:
>>>>>>> On 19/02/2019 16:26, Edwin Török wrote:
>>>>>>>> On 18/02/2019 18:27, Edwin Török wrote:
>>>>>>>>> Did a test today with CentOS 7.6 with upstream kernel and with
>>>>>>>>> 4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our
>>>>>>>>> patched [1] SBD) and was not able to reproduce the issue yet.
>>>>>>>> I was able to finally reproduce this using only upstream components
>>>>>>>> (although it seems to be easier to reproduce if we use our patched
>>>>>>>> SBD,
>>>>>>>> I was able to reproduce this by using only upstream packages
>>>>>>>> unpatched
>>>>>>>> by us):
>>>>>> Just out of curiosity: What did you patch in SBD?
>>>>>> Sorry if I missed the answer in the previous communication.
>>>>> It is mostly this PR, which calls getquorate quite often (a more
>>>>> efficient impl. would be to use the quorum notification API like
>>>>> dlm/pacemaker do, although see concerns in
>>>>> https://lists.clusterlabs.org/pipermail/users/2019-February/016249.html):
>>>>>
>>>>> https://github.com/ClusterLabs/sbd/pull/27
>>>>>
>>>>> We have also added our own servant for watching the health of our
>>>>> control plane, but that is not relevant to this bug (it reproduces with
>>>>> that watcher turned off too).
>>>>>
>>>>>>> I was also able to get a corosync blackbox from one of the stuck VMs
>>>>>>> that showed something interesting:
>>>>>>> https://clbin.com/d76Ha
>>>>>>>
>>>>>>> It is looping on:
>>>>>>> debug   Feb 19 16:37:24 mcast_sendmsg(408):12: sendmsg(mcast) failed
>>>>>>> (non-critical): Resource temporarily unavailable (11)
>>>>>> Hmm ... something like tx-queue of the device full, or no buffers
>>>>>> available anymore and kernel-thread doing the cleanup isn't
>>>>>> scheduled ...
>>>>> Yes that is very plausible. Perhaps it'd be nicer if corosync went back
>>>>> to the epoll_wait loop when it gets too many EAGAINs from sendmsg.
>>>> But this is exactly what happens. Corosync will call sendmsg to all
>>>> active udpu members and returns back to main loop -> epoll_wait.
>>>>
>>>>> (although this seems different from the original bug where it got stuck
>>>>> in epoll_wait)
>>>> I'm pretty sure it is.
>>>>
>>>> Anyway, let's try "sched_yield" idea. Could you please try included
>>>> patch and see if it makes any difference (only for udpu)?
>>> Thanks for the patch, unfortunately corosync still spins 106% even with
>>> yield:
>>> https://clbin.com/CF64x
>> Yep, it was kind of expected, but at lost worth a try. How does strace
>> look when this happens?
>>
> strace for the situation described below (corosync 95%, 1 vCPU):
> https://clbin.com/hZL5z
>
>> Also Klaus had an idea to try remove sbd from the picture and try
>> different RR process to find out what happens. And I think it's again
>> worth try.
>>
>> Could you please try install/enable/start
>> https://github.com/jfriesse/spausedd (packages built by copr are
>> https://copr.fedorainfracloud.org/coprs/honzaf/spausedd/),
>> disable/remove sbd and run your test?
>>
> Tried this with 4 vCPUs but it didn't detect any pauses (kind of
> expected, plenty of other CPUs to have spausedd scheduled on even if
> corosync hogs one).
> Then tried with 1 vCPU and it didn't detect any pauses here either. The
> 95% max realtime runtime protection kicks in and limits corosync to 95%
> CPU since now global 95% = this CPU's 95% since there is only one.
> Interestingly corosync stays running at 95% CPU usage now, unlike in SMP
> scenarios where the 95% limit was enough to avoid the situation.

Hmm maybe the thing that should be scheduled is running at
SCHED_RR as well but with just a lower prio. So it wouldn't
profit from the sched_yield and it wouldn't get anything of
the 5% either.

> There is a visible keyboard lag over SSH, but spausedd does get scheduled:

But it shows at least that the sched_yield is doing something ...
although the RR might be triggered even without the explicit
sched_yield in cases like this.

>
> 4923 root      rt   0  213344 111036  84732 R 93.3  2.8  19:07.50
> corosync
>
>      9 root      20   0       0      0      0 S  0.3  0.0   0:02.02
> ksoftirqd/0
>
>   4256 root      rt   0   88804  35612  34368 R  0.3  0.9   0:04.36
> spausedd
>
> No SBD running, and corosync-blackbox does not work at all now.
>
> ifconfig eth0; sleep 1; ifconfig eth0
> eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
>         inet 10.62.98.26  netmask 255.255.240.0  broadcast 10.62.111.255
>         inet6 fd06:7768:b9e5:8c50:4c0e:daff:fe14:55f0  prefixlen 64
> scopeid 0x0<global>
>         inet6 fe80::4c0e:daff:fe14:55f0  prefixlen 64  scopeid 0x20<link>
>         ether 4e:0e:da:14:55:f0  txqueuelen 1000  (Ethernet)
>         RX packets 4929031  bytes 4644924223 (4.3 GiB)
>         RX errors 0  dropped 0  overruns 0  frame 0
>         TX packets 15445220  bytes 22485760076 (20.9 GiB)
>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
> eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
>         inet 10.62.98.26  netmask 255.255.240.0  broadcast 10.62.111.255
>         inet6 fd06:7768:b9e5:8c50:4c0e:daff:fe14:55f0  prefixlen 64
> scopeid 0x0<global>
>         inet6 fe80::4c0e:daff:fe14:55f0  prefixlen 64  scopeid 0x20<link>
>         ether 4e:0e:da:14:55:f0  txqueuelen 1000  (Ethernet)
>         RX packets 4934042  bytes 4649642415 (4.3 GiB)
>         RX errors 0  dropped 0  overruns 0  frame 0
>         TX packets 15464987  bytes 22514562910 (20.9 GiB)
>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
> Looking at this from the hypervisor side the VM is not sending many
> packets once CPU usage goes up:
> https://pasteboard.co/I246a1W.png
>
>>> On another host corosync failed to start up completely (Denied
>>> connection not ready), and:
>>> https://clbin.com/Z35Gl
>>> (I don't think this is related to the patch, it was doing that before
>>> when I looked at it this morning, kernel 4.20.0 this time)
>> This one looks kind of normal and I'm pretty sure it's unrelated (I've
>> seen it already sadly never was able to find a "reliable" reproducer)
> Got a fairly reliable one now, only 3/16 VMs were able to start
> corosync, one got stuck with CPU usage as shown above, the others failed
> to start with 'Denied connection'.
>
> In the past we've seen this kind of problem when dhclient happened to
> take up and listen on corosync's port, but thats been fixed in recent
> CentOS 7.x releases, and is not the case now.
>
> Anything you'd like me to try help debug this?
>
> Thanks,
> --Edwin




More information about the Users mailing list