[ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Wed Feb 20 08:08:59 EST 2019

Edwin Török napsal(a):
> On 20/02/2019 07:57, Jan Friesse wrote:
>> Edwin,
>>>
>>>
>>> On 19/02/2019 17:02, Klaus Wenninger wrote:
>>>> On 02/19/2019 05:41 PM, Edwin Török wrote:
>>>>> On 19/02/2019 16:26, Edwin Török wrote:
>>>>>> On 18/02/2019 18:27, Edwin Török wrote:
>>>>>>> Did a test today with CentOS 7.6 with upstream kernel and with
>>>>>>> 4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our
>>>>>>> patched [1] SBD) and was not able to reproduce the issue yet.
>>>>>> I was able to finally reproduce this using only upstream components
>>>>>> (although it seems to be easier to reproduce if we use our patched
>>>>>> SBD,
>>>>>> I was able to reproduce this by using only upstream packages unpatched
>>>>>> by us):
>>>>
>>>> Just out of curiosity: What did you patch in SBD?
>>>> Sorry if I missed the answer in the previous communication.
>>>
>>> It is mostly this PR, which calls getquorate quite often (a more
>>> efficient impl. would be to use the quorum notification API like
>>> dlm/pacemaker do, although see concerns in
>>> https://lists.clusterlabs.org/pipermail/users/2019-February/016249.html):
>>> https://github.com/ClusterLabs/sbd/pull/27
>>>
>>> We have also added our own servant for watching the health of our
>>> control plane, but that is not relevant to this bug (it reproduces with
>>> that watcher turned off too).
>>>
>>>>
>>>>> I was also able to get a corosync blackbox from one of the stuck VMs
>>>>> that showed something interesting:
>>>>> https://clbin.com/d76Ha
>>>>>
>>>>> It is looping on:
>>>>> debug   Feb 19 16:37:24 mcast_sendmsg(408):12: sendmsg(mcast) failed
>>>>> (non-critical): Resource temporarily unavailable (11)
>>>>
>>>> Hmm ... something like tx-queue of the device full, or no buffers
>>>> available anymore and kernel-thread doing the cleanup isn't
>>>> scheduled ...
>>>
>>> Yes that is very plausible. Perhaps it'd be nicer if corosync went back
>>> to the epoll_wait loop when it gets too many EAGAINs from sendmsg.
>>
>> But this is exactly what happens. Corosync will call sendmsg to all
>> active udpu members and returns back to main loop -> epoll_wait.
>>
>>> (although this seems different from the original bug where it got stuck
>>> in epoll_wait)
>>
>> I'm pretty sure it is.
>>
>> Anyway, let's try "sched_yield" idea. Could you please try included
>> patch and see if it makes any difference (only for udpu)?
> 
> Thanks for the patch, unfortunately corosync still spins 106% even with
> yield:
> https://clbin.com/CF64x

Yep, it was kind of expected, but at lost worth a try. How does strace 
look when this happens?

Also Klaus had an idea to try remove sbd from the picture and try 
different RR process to find out what happens. And I think it's again 
worth try.

Could you please try install/enable/start 
https://github.com/jfriesse/spausedd (packages built by copr are 
https://copr.fedorainfracloud.org/coprs/honzaf/spausedd/), 
disable/remove sbd and run your test?

> 
> On another host corosync failed to start up completely (Denied
> connection not ready), and:
> https://clbin.com/Z35Gl
> (I don't think this is related to the patch, it was doing that before
> when I looked at it this morning, kernel 4.20.0 this time)

This one looks kind of normal and I'm pretty sure it's unrelated (I've 
seen it already sadly never was able to find a "reliable" reproducer)

Regards,
   Honza

> 
> Best regards,
> --Edwin
> 
>>
>> Regards,
>>    Honza
>>
>>>
>>>> Does the kernel log anything in that situation?
>>>
>>> Other than the crmd segfault no.
>>>   From previous observations on xenserver the softirqs were all stuck on
>>> the CPU that corosync hogged 100% (I'll check this on upstream, but I'm
>>> fairly sure it'll be the same). softirqs do not run at realtime priority
>>> (if we increase the priority of ksoftirqd to realtime then it all gets
>>> unstuck), but seem to be essential for whatever corosync is stuck
>>> waiting on, in this case likely the sending/receiving of network packets.
>>>
>>> I'm trying to narrow down the kernel between 4.19.16 and 4.20.10 to see
>>> why this was only reproducible on 4.19 so far.
>>>
>>> Best regards,
>>> --Edwin
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>