[ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Wed Feb 20 12:37:19 EST 2019

On 20/02/2019 13:08, Jan Friesse wrote:
> Edwin Török napsal(a):
>> On 20/02/2019 07:57, Jan Friesse wrote:
>>> Edwin,
>>>>
>>>>
>>>> On 19/02/2019 17:02, Klaus Wenninger wrote:
>>>>> On 02/19/2019 05:41 PM, Edwin Török wrote:
>>>>>> On 19/02/2019 16:26, Edwin Török wrote:
>>>>>>> On 18/02/2019 18:27, Edwin Török wrote:
>>>>>>>> Did a test today with CentOS 7.6 with upstream kernel and with
>>>>>>>> 4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our
>>>>>>>> patched [1] SBD) and was not able to reproduce the issue yet.
>>>>>>> I was able to finally reproduce this using only upstream components
>>>>>>> (although it seems to be easier to reproduce if we use our patched
>>>>>>> SBD,
>>>>>>> I was able to reproduce this by using only upstream packages
>>>>>>> unpatched
>>>>>>> by us):
>>>>>
>>>>> Just out of curiosity: What did you patch in SBD?
>>>>> Sorry if I missed the answer in the previous communication.
>>>>
>>>> It is mostly this PR, which calls getquorate quite often (a more
>>>> efficient impl. would be to use the quorum notification API like
>>>> dlm/pacemaker do, although see concerns in
>>>> https://lists.clusterlabs.org/pipermail/users/2019-February/016249.html):
>>>>
>>>> https://github.com/ClusterLabs/sbd/pull/27
>>>>
>>>> We have also added our own servant for watching the health of our
>>>> control plane, but that is not relevant to this bug (it reproduces with
>>>> that watcher turned off too).
>>>>
>>>>>
>>>>>> I was also able to get a corosync blackbox from one of the stuck VMs
>>>>>> that showed something interesting:
>>>>>> https://clbin.com/d76Ha
>>>>>>
>>>>>> It is looping on:
>>>>>> debug   Feb 19 16:37:24 mcast_sendmsg(408):12: sendmsg(mcast) failed
>>>>>> (non-critical): Resource temporarily unavailable (11)
>>>>>
>>>>> Hmm ... something like tx-queue of the device full, or no buffers
>>>>> available anymore and kernel-thread doing the cleanup isn't
>>>>> scheduled ...
>>>>
>>>> Yes that is very plausible. Perhaps it'd be nicer if corosync went back
>>>> to the epoll_wait loop when it gets too many EAGAINs from sendmsg.
>>>
>>> But this is exactly what happens. Corosync will call sendmsg to all
>>> active udpu members and returns back to main loop -> epoll_wait.
>>>
>>>> (although this seems different from the original bug where it got stuck
>>>> in epoll_wait)
>>>
>>> I'm pretty sure it is.
>>>
>>> Anyway, let's try "sched_yield" idea. Could you please try included
>>> patch and see if it makes any difference (only for udpu)?
>>
>> Thanks for the patch, unfortunately corosync still spins 106% even with
>> yield:
>> https://clbin.com/CF64x
> 
> Yep, it was kind of expected, but at lost worth a try. How does strace
> look when this happens?
> 

strace for the situation described below (corosync 95%, 1 vCPU):
https://clbin.com/hZL5z

> Also Klaus had an idea to try remove sbd from the picture and try
> different RR process to find out what happens. And I think it's again
> worth try.
> 
> Could you please try install/enable/start
> https://github.com/jfriesse/spausedd (packages built by copr are
> https://copr.fedorainfracloud.org/coprs/honzaf/spausedd/),
> disable/remove sbd and run your test?
> 

Tried this with 4 vCPUs but it didn't detect any pauses (kind of
expected, plenty of other CPUs to have spausedd scheduled on even if
corosync hogs one).
Then tried with 1 vCPU and it didn't detect any pauses here either. The
95% max realtime runtime protection kicks in and limits corosync to 95%
CPU since now global 95% = this CPU's 95% since there is only one.
Interestingly corosync stays running at 95% CPU usage now, unlike in SMP
scenarios where the 95% limit was enough to avoid the situation.
There is a visible keyboard lag over SSH, but spausedd does get scheduled:

4923 root      rt   0  213344 111036  84732 R 93.3  2.8  19:07.50
corosync

     9 root      20   0       0      0      0 S  0.3  0.0   0:02.02
ksoftirqd/0

  4256 root      rt   0   88804  35612  34368 R  0.3  0.9   0:04.36
spausedd

No SBD running, and corosync-blackbox does not work at all now.

ifconfig eth0; sleep 1; ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.62.98.26  netmask 255.255.240.0  broadcast 10.62.111.255
        inet6 fd06:7768:b9e5:8c50:4c0e:daff:fe14:55f0  prefixlen 64
scopeid 0x0<global>
        inet6 fe80::4c0e:daff:fe14:55f0  prefixlen 64  scopeid 0x20<link>
        ether 4e:0e:da:14:55:f0  txqueuelen 1000  (Ethernet)
        RX packets 4929031  bytes 4644924223 (4.3 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 15445220  bytes 22485760076 (20.9 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.62.98.26  netmask 255.255.240.0  broadcast 10.62.111.255
        inet6 fd06:7768:b9e5:8c50:4c0e:daff:fe14:55f0  prefixlen 64
scopeid 0x0<global>
        inet6 fe80::4c0e:daff:fe14:55f0  prefixlen 64  scopeid 0x20<link>
        ether 4e:0e:da:14:55:f0  txqueuelen 1000  (Ethernet)
        RX packets 4934042  bytes 4649642415 (4.3 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 15464987  bytes 22514562910 (20.9 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Looking at this from the hypervisor side the VM is not sending many
packets once CPU usage goes up:
https://pasteboard.co/I246a1W.png

>>
>> On another host corosync failed to start up completely (Denied
>> connection not ready), and:
>> https://clbin.com/Z35Gl
>> (I don't think this is related to the patch, it was doing that before
>> when I looked at it this morning, kernel 4.20.0 this time)
> 
> This one looks kind of normal and I'm pretty sure it's unrelated (I've
> seen it already sadly never was able to find a "reliable" reproducer)

Got a fairly reliable one now, only 3/16 VMs were able to start
corosync, one got stuck with CPU usage as shown above, the others failed
to start with 'Denied connection'.

In the past we've seen this kind of problem when dhclient happened to
take up and listen on corosync's port, but thats been fixed in recent
CentOS 7.x releases, and is not the case now.

Anything you'd like me to try help debug this?

Thanks,
--Edwin