[ClusterLabs] pcsd processes using 100% CPU

Wed May 23 18:43:04 UTC 2018

Okay, I have this happening again on a couple servers right now, and am happy to let it spin and dig more into it.  I'm not at all experienced with stuff like this though, so will need some explicit instruction on what to do beyond what I've documented here...

I don't see anything of note in the pcsd.log - seems to just be normal activity being logged by the master process that isn't runaway.  Here's a snippet:

10.124.167.177 - - [23/May/2018:15:56:34 +0000] "GET /remote/get_configs HTTP/1.1" 200 553 0.0145
10.124.167.177 - - [23/May/2018:15:56:34 +0000] "GET /remote/get_configs HTTP/1.1" 200 553 0.0147
10.124.167.177 - - [23/May/2018:15:56:34 UTC] "GET /remote/get_configs HTTP/1.1" 200 553
- -> /remote/get_configs
I, [2018-05-23T15:56:37.972682 #1378]  INFO -- : Running: /usr/sbin/corosync-cmapctl totem.cluster_name
I, [2018-05-23T15:56:37.972805 #1378]  INFO -- : CIB USER: hacluster, groups: 
I, [2018-05-23T15:56:37.982066 #1378]  INFO -- : Return Value: 0
10.124.167.176 - - [23/May/2018:15:56:37 +0000] "GET /remote/get_configs HTTP/1.1" 200 553 0.0107
10.124.167.176 - - [23/May/2018:15:56:37 +0000] "GET /remote/get_configs HTTP/1.1" 200 553 0.0108
10.124.167.176 - - [23/May/2018:15:56:37 UTC] "GET /remote/get_configs HTTP/1.1" 200 553
- -> /remote/get_configs
I, [2018-05-23T15:57:10.648134 #1378]  INFO -- : Running: /usr/sbin/corosync-cmapctl totem.cluster_name
I, [2018-05-23T15:57:10.648276 #1378]  INFO -- : CIB USER: hacluster, groups: 
I, [2018-05-23T15:57:10.660617 #1378]  INFO -- : Return Value: 0
10.124.167.178 - - [23/May/2018:15:57:10 +0000] "GET /remote/get_configs HTTP/1.1" 200 553 0.0140
10.124.167.178 - - [23/May/2018:15:57:10 +0000] "GET /remote/get_configs HTTP/1.1" 200 553 0.0141
10.124.167.178 - - [23/May/2018:15:57:10 UTC] "GET /remote/get_configs HTTP/1.1" 200 553
- -> /remote/get_configs

I ran `strace -p <pid>`, and the screen filled with the following line repeating as fast as my terminal can render:
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0

I redirected this into a file for about 1 second and it filled with about 20,000 of those lines.

I installed ltrace, but didn't really know how to use it...

`ltrace -p <pid>` didn't output anything.

`ltrace -p <pid> -S` showed something similar to strace:

SYS_sched_yield(0x7f0ebc3f5c40, 0x7f0ebc3f5c40, 0, 0x7273752f3a6e6962)                                                       = 0
SYS_sched_yield(0x7f0ebc3f5c40, 0x7f0ebc3f5c40, 0, 0x7273752f3a6e6962)                                                       = 0
SYS_sched_yield(0x7f0ebc3f5c40, 0x7f0ebc3f5c40, 0, 0x7273752f3a6e6962)                                                       = 0

I next enabled debugging in /etc/default/pcsd and issued a `systemctl restart pcsd`.  Unfortunately, that killed the runaway child process.

However, I found another server where it's also happening again.  Debugging is not enabled there, but is there anything else I can do while the process is still running?

Here are the pcsd processes:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      6103  0.0  0.3 1076744 59972 ?       Ssl  Apr06  67:17 /usr/bin/ruby -C/var/lib/pcsd -I/usr/share/pcsd -- /usr/share/pcsd/ssl.rb & > /dev/null &
root     24923 99.8  0.3 1076744 52744 ?       Rl   May19 5556:31  \_ /usr/bin/ruby -C/var/lib/pcsd -I/usr/share/pcsd -- /usr/share/pcsd/ssl.rb & > /dev/null &

I don't have gcore installed and don't know which package might provide it.  I also don't have experience with gdb but am happy to try anything suggested to help figure out what's going on.

The pcs version is 0.9.149, as packaged by Debian and inherited by Ubuntu.

Regards,
-- 
Casey