[ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Fri Feb 15 16:08:16 UTC 2019

On 15/02/2019 13:06, Edwin Török wrote:
> 
> 
> On 15/02/2019 11:12, Christine Caulfield wrote:
>> On 15/02/2019 10:56, Edwin Török wrote:
>>> On 15/02/2019 09:31, Christine Caulfield wrote:
>>>> On 14/02/2019 17:33, Edwin Török wrote:
>>>>> Hello,
>>>>>
>>>>> We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and
>>>>> noticed a fundamental problem with realtime priorities:
>>>>> - corosync runs on CPU3, and interrupts for the NIC used by corosync are
>>>>> also routed to CPU3
>>>>> - corosync runs with SCHED_RR, ksoftirqd does not (should it?), but
>>>>> without it packets sent/received from that interface would not get processed
>>>>> - corosync is in a busy loop using 100% CPU, never giving a chance for
>>>>> softirqs to be processed (including TIMER and SCHED)
>>>>>
>>>>
>>>>
>>>> Can you tell me what distribution this is please? 
>>> This is a not-yet-released development version of XenServer based on
>>> CentOS 7.5/7.6.
>>> The kernel is 4.19.19 + patches to make it work well with Xen
>>> (previously we were using a 4.4.52 + Xen patches and backports kernel)
>>>
>>> The versions of packages are:
>>> rpm -q libqb corosync dlm sbd kernel
>>> libqb-1.0.1-6.el7.x86_64
>>> corosync-2.4.3-13.xs+2.0.0.x86_64
>>> dlm-4.0.7-1.el7.x86_64
>>> sbd-1.3.1-7.xs+2.0.0.x86_64
>>> kernel-4.19.19-5.0.0.x86_64
>>>
>>> Package versions with +xs in version have xenserver specific patches
>>> applied, libqb is coming straight from upstream CentOS here:
>>> https://git.centos.org/tree/rpms!libqb.git/fe522aa5e0af26c0cff1170b6d766b5f248778d2
>>>
>>>> There are patches to
>>>> libqb that should be applied to fix a similar problem in 1.0.1-6 - but
>>>> that's a RHEL version and kernel 4.19 is not a RHEL 7 kernel, so I just
>>>> need to be sure that those fixes are in your libqb before going any
>>> further.
>>>
>>> We have libqb 1.0.1-6 from CentOS, it looks like there is 1.0.1-7 which
>>> includes an SHM crash fix, is this the one you were refering to, or is
>>> there an additional patch elsewhere?
>>> https://git.centos.org/commit/rpms!libqb.git/b5ede72cb0faf5b70ddd504822552fe97bfbbb5e
>>>
>>
>> Thanks. libqb-1.0.1-6 does have the patch I was thinking of - I mainly
>> wanted to check it wasn't someone else's package that didn't have that
>> patch in. The SHM patch in -7 fixes a race at shutdown (often seen with
>> sbd). That shouldn't be a problem because there is a workaround in -6
>> anyway, and it's not fixing a spin, which is what we have here of course.
>>
>> Are there any messages in the system logs from either corosync or
>> related subsystems?
> 
> 
> I tried again with 'debug: trace', lots of process pause here:
> https://clbin.com/ZUHpd
> 
> And here is an strace running realtime prio 99, a LOT of epoll_wait and
> sendmsg (gz format):
> https://clbin.com/JINiV
> 
> It detects large numbers of members left, but I think this is because
> the corosync on those hosts got similarly stuck:
> Feb 15 12:51:07 localhost corosync[29278]:  [TOTEM ] A new membership
> (10.62.161.158:3152) was formed. Members left: 2 14 3 9 5 11 4 12 8 13 7
> 1 10
> Feb 15 12:51:07 localhost corosync[29278]:  [TOTEM ] Failed to receive
> the leave message. failed: 2 14 3 9 5 11 4 12 8 13 7 1 10
> 
> Looking on another host that is still stuck 100% corosync it says:
> https://clbin.com/6UOn6
> 

Thanks, that's really quite odd. I have vague recollections of a problem
where corosync was spinning on epoll without reading anything but can't
find the details at the moment, annoying.

Some thing you might be able to try that might help.

1) is is possible to run without sbd. Sometimes too much polling from
clients can cause odd behaviour
2) is it possible to try with a different kernel? We've tried a vanilla
4.19 and it's fine, but not with the Xen patches obviously
3) Does running corosync with the -p option help?

Is there any situation where this has worked? either with different
components or different corosync.conf files?

Also, and I don't think this is directly related to the issue, but I can
see configuration reloads happening from 2 nodes every 5 seconds. It's
very odd and maybe not what you want!

Chrissie