[ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Tue Feb 19 17:54:33 UTC 2019

On 02/19/2019 06:21 PM, Edwin Török wrote:
>
> On 19/02/2019 17:02, Klaus Wenninger wrote:
>> On 02/19/2019 05:41 PM, Edwin Török wrote:
>>> On 19/02/2019 16:26, Edwin Török wrote:
>>>> On 18/02/2019 18:27, Edwin Török wrote:
>>>>> Did a test today with CentOS 7.6 with upstream kernel and with
>>>>> 4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our
>>>>> patched [1] SBD) and was not able to reproduce the issue yet.
>>>> I was able to finally reproduce this using only upstream components
>>>> (although it seems to be easier to reproduce if we use our patched SBD,
>>>> I was able to reproduce this by using only upstream packages unpatched
>>>> by us):
>> Just out of curiosity: What did you patch in SBD?
>> Sorry if I missed the answer in the previous communication.
> It is mostly this PR, which calls getquorate quite often (a more
> efficient impl. would be to use the quorum notification API like
> dlm/pacemaker do, although see concerns in
> https://lists.clusterlabs.org/pipermail/users/2019-February/016249.html):
> https://github.com/ClusterLabs/sbd/pull/27

Ooh yes totally forgotten about that ... bad conscience ...

>
> We have also added our own servant for watching the health of our
> control plane, but that is not relevant to this bug (it reproduces with
> that watcher turned off too).
>
>>> I was also able to get a corosync blackbox from one of the stuck VMs
>>> that showed something interesting:
>>> https://clbin.com/d76Ha
>>>
>>> It is looping on:
>>> debug   Feb 19 16:37:24 mcast_sendmsg(408):12: sendmsg(mcast) failed
>>> (non-critical): Resource temporarily unavailable (11)
>> Hmm ... something like tx-queue of the device full, or no buffers
>> available anymore and kernel-thread doing the cleanup isn't
>> scheduled ...
> Yes that is very plausible. Perhaps it'd be nicer if corosync went back
> to the epoll_wait loop when it gets too many EAGAINs from sendmsg.
> (although this seems different from the original bug where it got stuck
> in epoll_wait)
>
>> Does the kernel log anything in that situation?
> Other than the crmd segfault no.
> From previous observations on xenserver the softirqs were all stuck on
> the CPU that corosync hogged 100% (I'll check this on upstream, but I'm
> fairly sure it'll be the same). softirqs do not run at realtime priority
> (if we increase the priority of ksoftirqd to realtime then it all gets
> unstuck), but seem to be essential for whatever corosync is stuck
> waiting on, in this case likely the sending/receiving of network packets.
>
> I'm trying to narrow down the kernel between 4.19.16 and 4.20.10 to see
> why this was only reproducible on 4.19 so far.

Maybe an issue of that kernel with distributing the load over cores ...
Can you provoke it by trying on a single-core or doing some pinning
of softirqd and corosync to the same core?
Just unfortunate that this is the LTS ...

>
> Best regards,
> --Edwin