[ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Mon Feb 18 10:49:40 EST 2019

On 02/18/2019 04:15 PM, Christine Caulfield wrote:
> On 15/02/2019 16:58, Edwin Török wrote:
>> On 15/02/2019 16:08, Christine Caulfield wrote:
>>> On 15/02/2019 13:06, Edwin Török wrote:
>>>> I tried again with 'debug: trace', lots of process pause here:
>>>> https://clbin.com/ZUHpd
>>>>
>>>> And here is an strace running realtime prio 99, a LOT of epoll_wait and
>>>> sendmsg (gz format):
>>>> https://clbin.com/JINiV
>>>>
>>>> It detects large numbers of members left, but I think this is because
>>>> the corosync on those hosts got similarly stuck:
>>>> Feb 15 12:51:07 localhost corosync[29278]:  [TOTEM ] A new membership
>>>> (10.62.161.158:3152) was formed. Members left: 2 14 3 9 5 11 4 12 8 13 7
>>>> 1 10
>>>> Feb 15 12:51:07 localhost corosync[29278]:  [TOTEM ] Failed to receive
>>>> the leave message. failed: 2 14 3 9 5 11 4 12 8 13 7 1 10
>>>>
>>>> Looking on another host that is still stuck 100% corosync it says:
>>>> https://clbin.com/6UOn6
>>>>
>>> Thanks, that's really quite odd. I have vague recollections of a problem
>>> where corosync was spinning on epoll without reading anything but can't
>>> find the details at the moment, annoying.
>>>
>>> Some thing you might be able to try that might help.
>>>
>>> 1) is is possible to run without sbd. Sometimes too much polling from
>>> clients can cause odd behaviour

That results without sbd might be especially interesting in the light
of the issue being triggered via config-reloads. Sbd has callbacks
registered
(RR at 99 as well) to be kicked off by config-reloads as well.

>>> 2) is it possible to try with a different kernel? We've tried a vanilla
>>> 4.19 and it's fine, but not with the Xen patches obviously
>> I'll try with some bare-metal upstream distros and report back the repro
>> steps if I can get it to reliably repro, hopefully early next week, it
>> is unlikely I'll get a working repro today.
>>
>>> 3) Does running corosync with the -p option help?
>> Yes, with "-p" I was able to run cluster create/GFS2 plug/unplug/destroy
>> on 16 physical hosts in a loop for an hour with any crashes (previously
>> it would crash within minutes).
>>
>> I found another workaround too:
>> echo NO_RT_RUNTIME_SHARE >/sys/kernel/debug/sched_features
>>
>> This makes the 95% realtime process CPU limit from
>> sched_rt_runtime_us/sched_rt_period_us apply per core, instead of
>> globally, so there would be 5% time left for non-realtime tasks on each
>> core. Seems to be enough to avoid the livelock, I was not able to
>> observe corosync using high CPU % anymore.
>> Still got more tests to run on this over the weekend, but looks promising.
>>
>> This is a safety layer of course, to prevent the system from fencing if
>> we encounter high CPU usage in corosync/libq. I am still interested in
>> tracking down the corosync/libq issue as it shouldn't have happened in
>> the first place.
>>
> That's helpful to know. Does corosync still use lots of CPU time in this
> situation (without RT) or does it behave normally?

I'd expect the high load coming from some kind of busy-waiting (hidden
behind whatever complexity) on something that doesn't happen
because it is not scheduled. So I would under this other scheduler
conditions at the max expect a short spike till the scheduler
kicks in.

>
>>> Is there any situation where this has worked? either with different
>>> components or different corosync.conf files?
>>>
>>> Also, and I don't think this is directly related to the issue, but I can
>>> see configuration reloads happening from 2 nodes every 5 seconds. It's
>>> very odd and maybe not what you want!
>> The configuration reloads are a way of triggering this bug reliably, I
>> should've mentioned that earlier
>> (the problem happens during a configuration reload, but not always, and
>> by doing configuration reloads in a loop that just add/remove one node
>> the problem can be triggered reliably within minutes).
>>
>>
> I've been trying this on my (KVM) virtual machines today but I can't
> reproduce it on a Standard RHEL-7, so I'm interested to see how you get
> on with a different kernel.
>
> Chrissie
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org