[ClusterLabs] Antw: Re: corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Wed Feb 20 13:17:22 UTC 2019

>>> Edwin Török <edvin.torok at citrix.com> schrieb am 20.02.2019 um 12:30 in
Nachricht <0a49f593-1543-76e4-a8ab-06a48c596c23 at citrix.com>:
> On 20/02/2019 07:57, Jan Friesse wrote:
>> Edwin,
>>>
>>>
>>> On 19/02/2019 17:02, Klaus Wenninger wrote:
>>>> On 02/19/2019 05:41 PM, Edwin Török wrote:
>>>>> On 19/02/2019 16:26, Edwin Török wrote:
>>>>>> On 18/02/2019 18:27, Edwin Török wrote:
>>>>>>> Did a test today with CentOS 7.6 with upstream kernel and with
>>>>>>> 4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our
>>>>>>> patched [1] SBD) and was not able to reproduce the issue yet.
>>>>>> I was able to finally reproduce this using only upstream components
>>>>>> (although it seems to be easier to reproduce if we use our patched
>>>>>> SBD,
>>>>>> I was able to reproduce this by using only upstream packages unpatched
>>>>>> by us):
>>>>
>>>> Just out of curiosity: What did you patch in SBD?
>>>> Sorry if I missed the answer in the previous communication.
>>>
>>> It is mostly this PR, which calls getquorate quite often (a more
>>> efficient impl. would be to use the quorum notification API like
>>> dlm/pacemaker do, although see concerns in
>>> https://lists.clusterlabs.org/pipermail/users/2019-February/016249.html):

>>> https://github.com/ClusterLabs/sbd/pull/27 
>>>
>>> We have also added our own servant for watching the health of our
>>> control plane, but that is not relevant to this bug (it reproduces with
>>> that watcher turned off too).
>>>
>>>>
>>>>> I was also able to get a corosync blackbox from one of the stuck VMs
>>>>> that showed something interesting:
>>>>> https://clbin.com/d76Ha 
>>>>>
>>>>> It is looping on:
>>>>> debug   Feb 19 16:37:24 mcast_sendmsg(408):12: sendmsg(mcast) failed
>>>>> (non-critical): Resource temporarily unavailable (11)
>>>>
>>>> Hmm ... something like tx-queue of the device full, or no buffers
>>>> available anymore and kernel-thread doing the cleanup isn't
>>>> scheduled ...
>>>
>>> Yes that is very plausible. Perhaps it'd be nicer if corosync went back
>>> to the epoll_wait loop when it gets too many EAGAINs from sendmsg.
>> 
>> But this is exactly what happens. Corosync will call sendmsg to all
>> active udpu members and returns back to main loop -> epoll_wait.
>> 
>>> (although this seems different from the original bug where it got stuck
>>> in epoll_wait)
>> 
>> I'm pretty sure it is.
>> 
>> Anyway, let's try "sched_yield" idea. Could you please try included
>> patch and see if it makes any difference (only for udpu)?
> 
> Thanks for the patch, unfortunately corosync still spins 106% even with
> yield:
> https://clbin.com/CF64x 
> 
> On another host corosync failed to start up completely (Denied
> connection not ready), and:
> https://clbin.com/Z35Gl 
> (I don't think this is related to the patch, it was doing that before
> when I looked at it this morning, kernel 4.20.0 this time)

I wonder: Is it possible to run "iftop" and "top" (with proper high-speed
setting showing all threads and CPUs) while waiting for the problem to occur.
If I understand it correctly all those other terminals should freeze, so you'll
have plenty of time for snapshotting the output ;-) I expect that your network
load will be close to 100% on the interface, or the CPU handling traffic is
busy with running corosync.

> 
> Best regards,
> --Edwin
> 
>> 
>> Regards,
>>   Honza
>> 
>>>
>>>> Does the kernel log anything in that situation?
>>>
>>> Other than the crmd segfault no.
>>>  From previous observations on xenserver the softirqs were all stuck on
>>> the CPU that corosync hogged 100% (I'll check this on upstream, but I'm
>>> fairly sure it'll be the same). softirqs do not run at realtime priority
>>> (if we increase the priority of ksoftirqd to realtime then it all gets
>>> unstuck), but seem to be essential for whatever corosync is stuck
>>> waiting on, in this case likely the sending/receiving of network packets.
>>>
>>> I'm trying to narrow down the kernel between 4.19.16 and 4.20.10 to see
>>> why this was only reproducible on 4.19 so far.
>>>
>>> Best regards,
>>> --Edwin
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org 
>>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>>
>>> Project Home: http://www.clusterlabs.org 
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>> Bugs: http://bugs.clusterlabs.org 
>>>
>> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org