[ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?
Ken Gaillot
kgaillot at redhat.com
Wed Feb 20 10:57:15 EST 2019
On Wed, 2019-02-20 at 14:03 +0000, Edwin Török wrote:
>
> On 20/02/2019 12:44, Jan Pokorný wrote:
> > On 19/02/19 16:41 +0000, Edwin Török wrote:
> > > Also noticed this: [ 5390.361861] crmd[12620]: segfault at 0 ip
> > > 00007f221c5e03b1 sp 00007ffcf9cf9d88 error 4 in
> > > libc-2.17.so[7f221c554000+1c2000] [ 5390.361918] Code: b8 00 00
> > > 00 04 00 00 00 74 07 48 8d 05 f8 f2 0d 00 c3 0f 1f 80 00 00 00 00
> > > 48 31 c0 89 f9 83 e1 3f 66 0f ef c0 83 f9 30 77 19 <f3> 0f 6f 0f
> > > 66 0f 74 c1 66 0f d7 d0 85 d2 75 7a 48 89 f8 48 83 e0
> >
> > By any chance, is this an unmodified pacemaker package as
> > obtainable from some public repo together with debug symbols?
>
> I haven't modified pacemaker, here are the versions:
>
> rpm -q pacemaker
> pacemaker-1.1.19-8.el7.x86_64
> rpm -q glibc
> glibc-2.17-260.el7_6.3.x86_64
>
> 0x00007f221c5e03b1 - 0x7f221c554000 = 0x8c3b1
> addr2line -fie /lib64/libc.so.6 0x8c3b1
> __GI_strlen
> :?
>
> Feb 19 16:22:04 host-10 crmd[12620]: notice: Additional logging
> available in /var/log/cluster/corosync.log
> Feb 19 16:22:05 host-10 crmd[12620]: notice: Connecting to cluster
> infrastructure: corosync
> Feb 19 16:29:50 host-10 crmd[12620]: error: Could not join the CPG
> group 'crmd': 6
> Feb 19 16:29:50 host-10 kernel: crmd[12620]: segfault at 0 ip
> 00007f221c5e03b1 sp 00007ffcf9cf9d88 error 4 in
> libc-2.17.so[7f221c554000+1c2000]
> Feb 19 16:38:28 host-10 pacemakerd[12614]: error: Managed process
> 12620 (crmd) dumped core
> Feb 19 16:38:28 host-10 pacemakerd[12614]: error: The crmd process
> (12620) terminated with signal 11 (core=1)
>
> I found a core file in /var/lib/pacemaker/cores
> (gdb) bt
> #0 0x00007f221c5e03b1 in __strlen_sse2 () from /lib64/libc.so.6
> #1 0x00007f221c5e00be in strdup () from /lib64/libc.so.6
> #2 0x00007f221f1a05cd in election_init (name=name at entry=0x0,
> uname=0x0, period_ms=period_ms at entry=60000, cb=cb at entry=0x55ea42cb279
> 0
> <election_timeout_popped>)
> at election.c:78
The current code asserts that uname is non-NULL so this won't happen,
but of course that still is a crash.
> #3 0x000055ea42cb3d4c in do_ha_control (action=4, cause=<optimized
> out>, cur_state=<optimized out>, current_input=<optimized out>,
> msg_data=0x55ea4464fec0)
> at control.c:139
> #4 0x000055ea42cb0524 in s_crmd_fsa_actions
> (fsa_data=fsa_data at entry=0x55ea4464fec0) at fsa.c:305
> #5 0x000055ea42cb216a in s_crmd_fsa (cause=cause at entry=C_STARTUP) at
> fsa.c:237
> #6 0x000055ea42cad707 in crmd_init () at main.c:173
> #7 0x000055ea42cad510 in main (argc=1, argv=0x7ffcf9cfa078) at
> main.c:122
>
> g
>
> Best regards,
> --Edwin
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list