[ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Wed Feb 20 09:03:03 EST 2019

On 20/02/2019 12:44, Jan Pokorný wrote:
> On 19/02/19 16:41 +0000, Edwin Török wrote:
>> Also noticed this: [ 5390.361861] crmd[12620]: segfault at 0 ip
>> 00007f221c5e03b1 sp 00007ffcf9cf9d88 error 4 in
>> libc-2.17.so[7f221c554000+1c2000] [ 5390.361918] Code: b8 00 00
>> 00 04 00 00 00 74 07 48 8d 05 f8 f2 0d 00 c3 0f 1f 80 00 00 00 00
>> 48 31 c0 89 f9 83 e1 3f 66 0f ef c0 83 f9 30 77 19 <f3> 0f 6f 0f
>> 66 0f 74 c1 66 0f d7 d0 85 d2 75 7a 48 89 f8 48 83 e0
> 
> By any chance, is this an unmodified pacemaker package as
> obtainable from some public repo together with debug symbols?

I haven't modified pacemaker, here are the versions:

rpm -q pacemaker
pacemaker-1.1.19-8.el7.x86_64
rpm -q glibc
glibc-2.17-260.el7_6.3.x86_64

0x00007f221c5e03b1 - 0x7f221c554000 = 0x8c3b1
addr2line -fie /lib64/libc.so.6 0x8c3b1
__GI_strlen
:?

Feb 19 16:22:04 host-10 crmd[12620]:  notice: Additional logging
available in /var/log/cluster/corosync.log
Feb 19 16:22:05 host-10 crmd[12620]:  notice: Connecting to cluster
infrastructure: corosync
Feb 19 16:29:50 host-10 crmd[12620]:   error: Could not join the CPG
group 'crmd': 6
Feb 19 16:29:50 host-10 kernel: crmd[12620]: segfault at 0 ip
00007f221c5e03b1 sp 00007ffcf9cf9d88 error 4 in
libc-2.17.so[7f221c554000+1c2000]
Feb 19 16:38:28 host-10 pacemakerd[12614]:   error: Managed process
12620 (crmd) dumped core
Feb 19 16:38:28 host-10 pacemakerd[12614]:   error: The crmd process
(12620) terminated with signal 11 (core=1)

I found a core file in /var/lib/pacemaker/cores
(gdb) bt
#0  0x00007f221c5e03b1 in __strlen_sse2 () from /lib64/libc.so.6
#1  0x00007f221c5e00be in strdup () from /lib64/libc.so.6
#2  0x00007f221f1a05cd in election_init (name=name at entry=0x0,
uname=0x0, period_ms=period_ms at entry=60000, cb=cb at entry=0x55ea42cb2790
<election_timeout_popped>)
    at election.c:78
#3  0x000055ea42cb3d4c in do_ha_control (action=4, cause=<optimized
out>, cur_state=<optimized out>, current_input=<optimized out>,
msg_data=0x55ea4464fec0)
    at control.c:139
#4  0x000055ea42cb0524 in s_crmd_fsa_actions
(fsa_data=fsa_data at entry=0x55ea4464fec0) at fsa.c:305
#5  0x000055ea42cb216a in s_crmd_fsa (cause=cause at entry=C_STARTUP) at
fsa.c:237
#6  0x000055ea42cad707 in crmd_init () at main.c:173
#7  0x000055ea42cad510 in main (argc=1, argv=0x7ffcf9cfa078) at main.c:122

g

Best regards,
--Edwin