[ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Fri Feb 15 13:06:16 UTC 2019

On 15/02/2019 11:12, Christine Caulfield wrote:
> On 15/02/2019 10:56, Edwin Török wrote:
>> On 15/02/2019 09:31, Christine Caulfield wrote:
>>> On 14/02/2019 17:33, Edwin Török wrote:
>>>> Hello,
>>>>
>>>> We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and
>>>> noticed a fundamental problem with realtime priorities:
>>>> - corosync runs on CPU3, and interrupts for the NIC used by corosync are
>>>> also routed to CPU3
>>>> - corosync runs with SCHED_RR, ksoftirqd does not (should it?), but
>>>> without it packets sent/received from that interface would not get processed
>>>> - corosync is in a busy loop using 100% CPU, never giving a chance for
>>>> softirqs to be processed (including TIMER and SCHED)
>>>>
>>>
>>>
>>> Can you tell me what distribution this is please? 
>> This is a not-yet-released development version of XenServer based on
>> CentOS 7.5/7.6.
>> The kernel is 4.19.19 + patches to make it work well with Xen
>> (previously we were using a 4.4.52 + Xen patches and backports kernel)
>>
>> The versions of packages are:
>> rpm -q libqb corosync dlm sbd kernel
>> libqb-1.0.1-6.el7.x86_64
>> corosync-2.4.3-13.xs+2.0.0.x86_64
>> dlm-4.0.7-1.el7.x86_64
>> sbd-1.3.1-7.xs+2.0.0.x86_64
>> kernel-4.19.19-5.0.0.x86_64
>>
>> Package versions with +xs in version have xenserver specific patches
>> applied, libqb is coming straight from upstream CentOS here:
>> https://git.centos.org/tree/rpms!libqb.git/fe522aa5e0af26c0cff1170b6d766b5f248778d2
>>
>>> There are patches to
>>> libqb that should be applied to fix a similar problem in 1.0.1-6 - but
>>> that's a RHEL version and kernel 4.19 is not a RHEL 7 kernel, so I just
>>> need to be sure that those fixes are in your libqb before going any
>> further.
>>
>> We have libqb 1.0.1-6 from CentOS, it looks like there is 1.0.1-7 which
>> includes an SHM crash fix, is this the one you were refering to, or is
>> there an additional patch elsewhere?
>> https://git.centos.org/commit/rpms!libqb.git/b5ede72cb0faf5b70ddd504822552fe97bfbbb5e
>>
> 
> Thanks. libqb-1.0.1-6 does have the patch I was thinking of - I mainly
> wanted to check it wasn't someone else's package that didn't have that
> patch in. The SHM patch in -7 fixes a race at shutdown (often seen with
> sbd). That shouldn't be a problem because there is a workaround in -6
> anyway, and it's not fixing a spin, which is what we have here of course.
> 
> Are there any messages in the system logs from either corosync or
> related subsystems?

I tried again with 'debug: trace', lots of process pause here:
https://clbin.com/ZUHpd

And here is an strace running realtime prio 99, a LOT of epoll_wait and
sendmsg (gz format):
https://clbin.com/JINiV

It detects large numbers of members left, but I think this is because
the corosync on those hosts got similarly stuck:
Feb 15 12:51:07 localhost corosync[29278]:  [TOTEM ] A new membership
(10.62.161.158:3152) was formed. Members left: 2 14 3 9 5 11 4 12 8 13 7
1 10
Feb 15 12:51:07 localhost corosync[29278]:  [TOTEM ] Failed to receive
the leave message. failed: 2 14 3 9 5 11 4 12 8 13 7 1 10

Looking on another host that is still stuck 100% corosync it says:
https://clbin.com/6UOn6

Feb 15 13:01:56 localhost corosync[30153]:  [MAIN  ] Totem is unable to
form a cluster because of an operating system or network fault (reason:
totem is continuously in gather state). The most common cause of this
message is that the local firewall is configured improperly.
Feb 15 13:01:58 localhost corosync[30153]:  [MAIN  ] Totem is unable to
form a cluster because of an operating system or network fault (reason:
totem is continuously in gather state). The most common cause of this
message is that the local firewall is configured improperly.

Hope this helps,
--Edwin