[ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Wed Feb 20 06:30:42 EST 2019

On 20/02/2019 07:57, Jan Friesse wrote:
> Edwin,
>>
>>
>> On 19/02/2019 17:02, Klaus Wenninger wrote:
>>> On 02/19/2019 05:41 PM, Edwin Török wrote:
>>>> On 19/02/2019 16:26, Edwin Török wrote:
>>>>> On 18/02/2019 18:27, Edwin Török wrote:
>>>>>> Did a test today with CentOS 7.6 with upstream kernel and with
>>>>>> 4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our
>>>>>> patched [1] SBD) and was not able to reproduce the issue yet.
>>>>> I was able to finally reproduce this using only upstream components
>>>>> (although it seems to be easier to reproduce if we use our patched
>>>>> SBD,
>>>>> I was able to reproduce this by using only upstream packages unpatched
>>>>> by us):
>>>
>>> Just out of curiosity: What did you patch in SBD?
>>> Sorry if I missed the answer in the previous communication.
>>
>> It is mostly this PR, which calls getquorate quite often (a more
>> efficient impl. would be to use the quorum notification API like
>> dlm/pacemaker do, although see concerns in
>> https://lists.clusterlabs.org/pipermail/users/2019-February/016249.html):
>> https://github.com/ClusterLabs/sbd/pull/27
>>
>> We have also added our own servant for watching the health of our
>> control plane, but that is not relevant to this bug (it reproduces with
>> that watcher turned off too).
>>
>>>
>>>> I was also able to get a corosync blackbox from one of the stuck VMs
>>>> that showed something interesting:
>>>> https://clbin.com/d76Ha
>>>>
>>>> It is looping on:
>>>> debug   Feb 19 16:37:24 mcast_sendmsg(408):12: sendmsg(mcast) failed
>>>> (non-critical): Resource temporarily unavailable (11)
>>>
>>> Hmm ... something like tx-queue of the device full, or no buffers
>>> available anymore and kernel-thread doing the cleanup isn't
>>> scheduled ...
>>
>> Yes that is very plausible. Perhaps it'd be nicer if corosync went back
>> to the epoll_wait loop when it gets too many EAGAINs from sendmsg.
> 
> But this is exactly what happens. Corosync will call sendmsg to all
> active udpu members and returns back to main loop -> epoll_wait.
> 
>> (although this seems different from the original bug where it got stuck
>> in epoll_wait)
> 
> I'm pretty sure it is.
> 
> Anyway, let's try "sched_yield" idea. Could you please try included
> patch and see if it makes any difference (only for udpu)?

Thanks for the patch, unfortunately corosync still spins 106% even with
yield:
https://clbin.com/CF64x

On another host corosync failed to start up completely (Denied
connection not ready), and:
https://clbin.com/Z35Gl
(I don't think this is related to the patch, it was doing that before
when I looked at it this morning, kernel 4.20.0 this time)

Best regards,
--Edwin

> 
> Regards,
>   Honza
> 
>>
>>> Does the kernel log anything in that situation?
>>
>> Other than the crmd segfault no.
>>  From previous observations on xenserver the softirqs were all stuck on
>> the CPU that corosync hogged 100% (I'll check this on upstream, but I'm
>> fairly sure it'll be the same). softirqs do not run at realtime priority
>> (if we increase the priority of ksoftirqd to realtime then it all gets
>> unstuck), but seem to be essential for whatever corosync is stuck
>> waiting on, in this case likely the sending/receiving of network packets.
>>
>> I'm trying to narrow down the kernel between 4.19.16 and 4.20.10 to see
>> why this was only reproducible on 4.19 so far.
>>
>> Best regards,
>> --Edwin
>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>