[ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Wed Feb 20 10:57:15 EST 2019

On Wed, 2019-02-20 at 14:03 +0000, Edwin Török wrote:
> 
> On 20/02/2019 12:44, Jan Pokorný wrote:
> > On 19/02/19 16:41 +0000, Edwin Török wrote:
> > > Also noticed this: [ 5390.361861] crmd[12620]: segfault at 0 ip
> > > 00007f221c5e03b1 sp 00007ffcf9cf9d88 error 4 in
> > > libc-2.17.so[7f221c554000+1c2000] [ 5390.361918] Code: b8 00 00
> > > 00 04 00 00 00 74 07 48 8d 05 f8 f2 0d 00 c3 0f 1f 80 00 00 00 00
> > > 48 31 c0 89 f9 83 e1 3f 66 0f ef c0 83 f9 30 77 19 <f3> 0f 6f 0f
> > > 66 0f 74 c1 66 0f d7 d0 85 d2 75 7a 48 89 f8 48 83 e0
> > 
> > By any chance, is this an unmodified pacemaker package as
> > obtainable from some public repo together with debug symbols?
> 
> I haven't modified pacemaker, here are the versions:
> 
> rpm -q pacemaker
> pacemaker-1.1.19-8.el7.x86_64
> rpm -q glibc
> glibc-2.17-260.el7_6.3.x86_64
> 
> 0x00007f221c5e03b1 - 0x7f221c554000 = 0x8c3b1
> addr2line -fie /lib64/libc.so.6 0x8c3b1
> __GI_strlen
> :?
> 
> Feb 19 16:22:04 host-10 crmd[12620]:  notice: Additional logging
> available in /var/log/cluster/corosync.log
> Feb 19 16:22:05 host-10 crmd[12620]:  notice: Connecting to cluster
> infrastructure: corosync
> Feb 19 16:29:50 host-10 crmd[12620]:   error: Could not join the CPG
> group 'crmd': 6
> Feb 19 16:29:50 host-10 kernel: crmd[12620]: segfault at 0 ip
> 00007f221c5e03b1 sp 00007ffcf9cf9d88 error 4 in
> libc-2.17.so[7f221c554000+1c2000]
> Feb 19 16:38:28 host-10 pacemakerd[12614]:   error: Managed process
> 12620 (crmd) dumped core
> Feb 19 16:38:28 host-10 pacemakerd[12614]:   error: The crmd process
> (12620) terminated with signal 11 (core=1)
> 
> I found a core file in /var/lib/pacemaker/cores
> (gdb) bt
> #0  0x00007f221c5e03b1 in __strlen_sse2 () from /lib64/libc.so.6
> #1  0x00007f221c5e00be in strdup () from /lib64/libc.so.6
> #2  0x00007f221f1a05cd in election_init (name=name at entry=0x0,
> uname=0x0, period_ms=period_ms at entry=60000, cb=cb at entry=0x55ea42cb279
> 0
> <election_timeout_popped>)
>     at election.c:78

The current code asserts that uname is non-NULL so this won't happen,
but of course that still is a crash.

> #3  0x000055ea42cb3d4c in do_ha_control (action=4, cause=<optimized
> out>, cur_state=<optimized out>, current_input=<optimized out>,
> msg_data=0x55ea4464fec0)
>     at control.c:139
> #4  0x000055ea42cb0524 in s_crmd_fsa_actions
> (fsa_data=fsa_data at entry=0x55ea4464fec0) at fsa.c:305
> #5  0x000055ea42cb216a in s_crmd_fsa (cause=cause at entry=C_STARTUP) at
> fsa.c:237
> #6  0x000055ea42cad707 in crmd_init () at main.c:173
> #7  0x000055ea42cad510 in main (argc=1, argv=0x7ffcf9cfa078) at
> main.c:122
> 
> g
> 
> Best regards,
> --Edwin
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org