[ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Mon Feb 18 13:27:44 EST 2019

On 18/02/2019 15:49, Klaus Wenninger wrote:
> On 02/18/2019 04:15 PM, Christine Caulfield wrote:
>> On 15/02/2019 16:58, Edwin Török wrote:
>>> On 15/02/2019 16:08, Christine Caulfield wrote:
>>>> On 15/02/2019 13:06, Edwin Török wrote:
>>>>> I tried again with 'debug: trace', lots of process pause here:
>>>>> https://clbin.com/ZUHpd
>>>>>
>>>>> And here is an strace running realtime prio 99, a LOT of epoll_wait and
>>>>> sendmsg (gz format):
>>>>> https://clbin.com/JINiV
>>>>>
>>>>> It detects large numbers of members left, but I think this is because
>>>>> the corosync on those hosts got similarly stuck:
>>>>> Feb 15 12:51:07 localhost corosync[29278]:  [TOTEM ] A new membership
>>>>> (10.62.161.158:3152) was formed. Members left: 2 14 3 9 5 11 4 12 8 13 7
>>>>> 1 10
>>>>> Feb 15 12:51:07 localhost corosync[29278]:  [TOTEM ] Failed to receive
>>>>> the leave message. failed: 2 14 3 9 5 11 4 12 8 13 7 1 10
>>>>>
>>>>> Looking on another host that is still stuck 100% corosync it says:
>>>>> https://clbin.com/6UOn6
>>>>>
>>>> Thanks, that's really quite odd. I have vague recollections of a problem
>>>> where corosync was spinning on epoll without reading anything but can't
>>>> find the details at the moment, annoying.
>>>>
>>>> Some thing you might be able to try that might help.
>>>>
>>>> 1) is is possible to run without sbd. Sometimes too much polling from
>>>> clients can cause odd behaviour
> 
> That results without sbd might be especially interesting in the light
> of the issue being triggered via config-reloads. Sbd has callbacks
> registered
> (RR at 99 as well) to be kicked off by config-reloads as well.

Did a test today with CentOS 7.6 with upstream kernel and with
4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our
patched [1] SBD) and was not able to reproduce the issue yet.
I'll keep trying to find out what it is in our environment that triggers
the bug (next step would probably be to try exact same kernel on both
CentOS 7.6 and XenServer).
Disabling SBD in the original environment on XenServer also didn't show
any high CPU usage in corosync.

One difference I see that there are far fewer quorum requests in the
corosync blackbox on an upstream CentOS 7.6, but that could be because
of [1] where we call quorum_getquorate every 1s, potentially putting
more load on corosync (and triggering the bug) by doing so.
A more efficient implementation could be to use the quorum tracking
callbacks like dlm and pacemaker do, although I'm worried that doing
that could end up in a split brain if:
* only watchdog based fencing is used in SBD
* cluster1 gets stuck in corosync spinning in epoll
* cluster1 looses network connectivity on clustering network
* other nodes in the cluster consider cluster1 not a member anymore and
consider it fenced after a timeout
* cluster1 doesn't send the quorum notification because it is stuck in
the above loop, and DLM would still consider it quorate
* cluster1 continues to do GFS2 operations on the storage network/fiber
channel as if it still had quorum

However upstream DLM and Pacemaker (and SBD through its pacemaker
inquisitor) use the quorum tracker callbacks, so maybe I am missing
something and it would actually be safe to use?

Nevertheless if frequent quering of corosync is triggering this issue,
that might still be worth fixing.

[1] https://github.com/ClusterLabs/sbd/pull/27

> 
>>>> 2) is it possible to try with a different kernel? We've tried a vanilla
>>>> 4.19 and it's fine, but not with the Xen patches obviously
>>> I'll try with some bare-metal upstream distros and report back the repro
>>> steps if I can get it to reliably repro, hopefully early next week, it
>>> is unlikely I'll get a working repro today.
>>>
>>>> 3) Does running corosync with the -p option help?
>>> Yes, with "-p" I was able to run cluster create/GFS2 plug/unplug/destroy
>>> on 16 physical hosts in a loop for an hour with any crashes (previously
>>> it would crash within minutes).
>>>
>>> I found another workaround too:
>>> echo NO_RT_RUNTIME_SHARE >/sys/kernel/debug/sched_features
>>>
>>> This makes the 95% realtime process CPU limit from
>>> sched_rt_runtime_us/sched_rt_period_us apply per core, instead of
>>> globally, so there would be 5% time left for non-realtime tasks on each
>>> core. Seems to be enough to avoid the livelock, I was not able to
>>> observe corosync using high CPU % anymore.
>>> Still got more tests to run on this over the weekend, but looks promising.
>>>
>>> This is a safety layer of course, to prevent the system from fencing if
>>> we encounter high CPU usage in corosync/libq. I am still interested in
>>> tracking down the corosync/libq issue as it shouldn't have happened in
>>> the first place.
>>>
>> That's helpful to know. Does corosync still use lots of CPU time in this
>> situation (without RT) or does it behave normally?
> 
> I'd expect the high load coming from some kind of busy-waiting (hidden
> behind whatever complexity) on something that doesn't happen
> because it is not scheduled. So I would under this other scheduler
> conditions at the max expect a short spike till the scheduler
> kicks in.

It was behaving normally.

Best regards,
--Edwin