[ClusterLabs] Antw: Re: corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Mon Feb 18 02:56:43 EST 2019

Hi!

I also wonder: With SCHED_RR would a sched_yield() at a proper place the 100%
CPU loop also fix this issue? Or do you think "we need real-time, and cannot
allow any other task to run"?

Regards,
Ulrich

>>> Edwin Török <edvin.torok at citrix.com> schrieb am 15.02.2019 um 17:58 in
Nachricht <a10652a1-b769-be24-ecb4-5b7efbe9d199 at citrix.com>:
> On 15/02/2019 16:08, Christine Caulfield wrote:
>> On 15/02/2019 13:06, Edwin Török wrote:
>>> I tried again with 'debug: trace', lots of process pause here:
>>> https://clbin.com/ZUHpd 
>>>
>>> And here is an strace running realtime prio 99, a LOT of epoll_wait and
>>> sendmsg (gz format):
>>> https://clbin.com/JINiV 
>>>
>>> It detects large numbers of members left, but I think this is because
>>> the corosync on those hosts got similarly stuck:
>>> Feb 15 12:51:07 localhost corosync[29278]:  [TOTEM ] A new membership
>>> (10.62.161.158:3152) was formed. Members left: 2 14 3 9 5 11 4 12 8 13 7
>>> 1 10
>>> Feb 15 12:51:07 localhost corosync[29278]:  [TOTEM ] Failed to receive
>>> the leave message. failed: 2 14 3 9 5 11 4 12 8 13 7 1 10
>>>
>>> Looking on another host that is still stuck 100% corosync it says:
>>> https://clbin.com/6UOn6 
>>>
>> 
>> Thanks, that's really quite odd. I have vague recollections of a problem
>> where corosync was spinning on epoll without reading anything but can't
>> find the details at the moment, annoying.
>> 
>> Some thing you might be able to try that might help.
>> 
>> 1) is is possible to run without sbd. Sometimes too much polling from
>> clients can cause odd behaviour
>> 2) is it possible to try with a different kernel? We've tried a vanilla
>> 4.19 and it's fine, but not with the Xen patches obviously
> 
> I'll try with some bare-metal upstream distros and report back the repro
> steps if I can get it to reliably repro, hopefully early next week, it
> is unlikely I'll get a working repro today.
> 
>> 3) Does running corosync with the -p option help?
> 
> Yes, with "-p" I was able to run cluster create/GFS2 plug/unplug/destroy
> on 16 physical hosts in a loop for an hour with any crashes (previously
> it would crash within minutes).
> 
> I found another workaround too:
> echo NO_RT_RUNTIME_SHARE >/sys/kernel/debug/sched_features
> 
> This makes the 95% realtime process CPU limit from
> sched_rt_runtime_us/sched_rt_period_us apply per core, instead of
> globally, so there would be 5% time left for non-realtime tasks on each
> core. Seems to be enough to avoid the livelock, I was not able to
> observe corosync using high CPU % anymore.
> Still got more tests to run on this over the weekend, but looks promising.
> 
> This is a safety layer of course, to prevent the system from fencing if
> we encounter high CPU usage in corosync/libq. I am still interested in
> tracking down the corosync/libq issue as it shouldn't have happened in
> the first place.
> 
>> 
>> Is there any situation where this has worked? either with different
>> components or different corosync.conf files?
>> 
>> Also, and I don't think this is directly related to the issue, but I can
>> see configuration reloads happening from 2 nodes every 5 seconds. It's
>> very odd and maybe not what you want!
> 
> The configuration reloads are a way of triggering this bug reliably, I
> should've mentioned that earlier
> (the problem happens during a configuration reload, but not always, and
> by doing configuration reloads in a loop that just add/remove one node
> the problem can be triggered reliably within minutes).
> 
> 
> Best regards,
> --Edwin
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org