[ClusterLabs] pcsd 99% CPU

Mon Feb 6 16:38:49 EST 2017

Jan and Tomas,

Even though it doesn't seem like I still have the 99% busy condition that I
reported on Friday,
I went ahead and generated a core dump, as Jan suggested in his earlier
reply:

"If you still have the pcsd/ruby process in that state, it might be
worth dumping a core for further off-line examination.  Assuming you
have enough space to store it (in order of gigabytes, it seems) and
gdb installed, you can do it like: gcore -o pcsd.core 23277

I have no idea how far the support for Ruby interpretation in gdb
goes (Python is quite well supported in terms of high level
debugging), but could be enough for figuring out what's going on.

If you are confident enough your cluster configuration does not
contain anything too confidential, it would perhaps be best if
you shared this core file in a compressed form privately with
tojeline at redhat.  Otherwise, you can use gdb itself to look
around the call stack in the core file, strings utility to guess
if there's excessive accumulation of particular strings, and similar
analyses, some of which are applicable also on live process, and
some would be usable only on live process (like strace)."

There is clearly still a very high occurrence of "High CIB load detected"
in the last 12 hours ...

[root at zs95kj cluster]# grep "High CIB load detected" /var/log/messages |wc
-l
3940

[root at zs95kj cluster]# date;grep "High CIB load detected" /var/log/messages
|tail
Mon Feb  6 12:10:30 EST 2017
Feb  6 12:06:00 zs95kj crmd[27233]:  notice: High CIB load detected:
0.993333
Feb  6 12:06:30 zs95kj crmd[27233]:  notice: High CIB load detected:
0.989333
Feb  6 12:07:00 zs95kj crmd[27233]:  notice: High CIB load detected:
0.994667
Feb  6 12:07:30 zs95kj crmd[27233]:  notice: High CIB load detected:
0.994000
Feb  6 12:08:00 zs95kj crmd[27233]:  notice: High CIB load detected:
0.993333
Feb  6 12:08:30 zs95kj crmd[27233]:  notice: High CIB load detected:
0.994333
Feb  6 12:09:00 zs95kj crmd[27233]:  notice: High CIB load detected:
0.997333
Feb  6 12:09:30 zs95kj crmd[27233]:  notice: High CIB load detected:
0.993000
Feb  6 12:10:00 zs95kj crmd[27233]:  notice: High CIB load detected:
0.990000
Feb  6 12:10:30 zs95kj crmd[27233]:  notice: High CIB load detected:
0.992333

... but, I don't currently see 99% CPU in the top output for the ruby
process.
The cib process does hit 99% :

top - 16:15:15 up 5 days,  6:35,  2 users,  load average: 3.05, 2.81, 2.76
Tasks: 1185 total,   3 running, 1182 sleeping,   0 stopped,   0 zombie
%Cpu(s): 16.0 us,  5.0 sy,  0.0 ni, 78.4 id,  0.0 wa,  0.1 hi,  0.4 si,
0.1 st
KiB Mem : 65795900 total,  3073336 free, 49808324 used, 12914240 buff/cache
KiB Swap: 10485657+total, 99452000 free,  5404572 used. 13466516 avail Mem

   PID       USER      PR  NI    VIRT    RES    SHR   S  %CPU %MEM     TIME
+ COMMAND
 27225     haclust+  20    0  110060  71608  19412   R  99.7      0.1
6339:35 cib
 23277     root         20    0   14.151g 9.626g 123696 S  19.8      15.3
938:17.53 ruby

Also, commands such as 'pcs status pcsd'  are returning very quickly now
(as opposed to
the 4-5 duration I reported earlier).

Anyway, I created the core on the cluster node which was showing the
hightest amount of
accumulated run time on the pcsd / ruby process (PID 23277):

[root at zs95kj pcsd_hang_zeus]# date;ps -ef |grep pcsd
Mon Feb  6 09:08:19 EST 2017
root      23253      1  0 Feb01 ?
00:00:00 /bin/sh /usr/lib/pcsd/pcsd start
root      23276  23253  0 Feb01 ?        00:00:00 /bin/bash -c ulimit -S -c
0 >/dev/null 2>&1 ; /usr/bin/ruby -I/usr/lib/pcsd /usr/lib/pcsd/ssl.rb
root      23277  23276 12 Feb01 ?        15:10:10 /usr/bin/ruby
-I/usr/lib/pcsd /usr/lib/pcsd/ssl.rb

I don't know if it matters, but I had to run the core dump twice on the
23277  process because I ran out of space on the first one.
I ended up deleting the first one.

To get the second one:

[root at zs95kj pcsd_hang_zeus]# date;gcore -o /guestimages/nfs1/pcsd.core2
23277
Mon Feb  6 09:09:51 EST 2017
[New LWP 43430]
[New LWP 24440]
[New LWP 23403]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x000003ffa91f42d4 in ?? () from /lib64/libc.so.6
warning: target file /proc/23277/cmdline contained unexpected null
characters
Saved corefile /guestimages/nfs1/pcsd.core2.23277
You have new mail in /var/spool/mail/root
[root at zs95kj pcsd_hang_zeus]#

I compressed it, and uploaded it to ibm.box.com , along with the recent
pcsd and messages logs from my node.
I will send the box link to Tomas in a separate email thread.

For the heck of it, I also got a core dump of the cib process, and uploaded
that too:

[root at zs95kj ~]# date;gcore -o /guestimages/nfs1/zs95kj_cib_core_Feb6
27225
Mon Feb  6 16:21:03 EST 2017
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x000003ffa8a885aa in BZ2_compressBlock () from /lib64/libbz2.so.1
Saved corefile /guestimages/nfs1/zs95kj_cib_core_Feb6.27225

In the pcsd log, I see "pcs status nodes corosync" requests (approximately)
once per 64 seconds.

[root at zs95kj cluster]# grep "pcs status nodes
corosync" /var/log/pcsd/pcsd.log |wc -l
1861

[root at zs95kj ~]# grep "pcs status nodes corosync" /var/log/pcsd/pcsd.log |
tail
I, [2017-02-06T16:18:53.468885 #23277]  INFO -- : Running: /usr/sbin/pcs
status nodes corosync
I, [2017-02-06T16:19:57.139796 #23277]  INFO -- : Running: /usr/sbin/pcs
status nodes corosync
I, [2017-02-06T16:21:00.833517 #23277]  INFO -- : Running: /usr/sbin/pcs
status nodes corosync
I, [2017-02-06T16:22:04.151613 #23277]  INFO -- : Running: /usr/sbin/pcs
status nodes corosync
I, [2017-02-06T16:23:07.803426 #23277]  INFO -- : Running: /usr/sbin/pcs
status nodes corosync
I, [2017-02-06T16:24:11.360107 #23277]  INFO -- : Running: /usr/sbin/pcs
status nodes corosync
I, [2017-02-06T16:25:14.828617 #23277]  INFO -- : Running: /usr/sbin/pcs
status nodes corosync
I, [2017-02-06T16:26:18.553637 #23277]  INFO -- : Running: /usr/sbin/pcs
status nodes corosync
I, [2017-02-06T16:27:22.363723 #23277]  INFO -- : Running: /usr/sbin/pcs
status nodes corosync
I, [2017-02-06T16:28:26.330005 #23277]  INFO -- : Running: /usr/sbin/pcs
status nodes corosync

A typical, repeating set of entries in the pcsd.log look like this:

I, [2017-02-06T16:29:26.367339 #23277]  INFO -- :
Running: /usr/sbin/corosync-cmapctl totem.cluster_name
I, [2017-02-06T16:29:26.367458 #23277]  INFO -- : CIB USER: hacluster,
groups:
I, [2017-02-06T16:29:27.438925 #23277]  INFO -- : Return Value: 0
::ffff:10.20.93.13 - - [06/Feb/2017:16:29:27 -0500]
"GET /remote/get_configs HTTP/1.1" 200 1077 1.0768
::ffff:10.20.93.13 - - [06/Feb/2017:16:29:27 -0500]
"GET /remote/get_configs HTTP/1.1" 200 1077 1.0770
zs93KLpcs1.pokprv.stglabs.ibm.com - - [06/Feb/2017:16:29:26 EST]
"GET /remote/get_configs HTTP/1.1" 200 1077
- -> /remote/get_configs
I, [2017-02-06T16:29:28.816100 #23277]  INFO -- :
Running: /usr/sbin/corosync-cmapctl totem.cluster_name
I, [2017-02-06T16:29:28.816288 #23277]  INFO -- : CIB USER: hacluster,
groups:
I, [2017-02-06T16:29:29.890788 #23277]  INFO -- : Return Value: 0
I, [2017-02-06T16:29:29.890958 #23277]  INFO -- : Running: /usr/sbin/pcs
status nodes corosync
I, [2017-02-06T16:29:29.890990 #23277]  INFO -- : CIB USER: hacluster,
groups:
I, [2017-02-06T16:29:31.082885 #23277]  INFO -- : Return Value: 0
I, [2017-02-06T16:29:31.084041 #23277]  INFO -- : SRWT Node: zs90kppcs1
Request: get_configs
I, [2017-02-06T16:29:31.085012 #23277]  INFO -- : SRWT Node: zs93kjpcs1
Request: get_configs
I, [2017-02-06T16:29:31.086098 #23277]  INFO -- : SRWT Node: zs93KLpcs1
Request: get_configs
I, [2017-02-06T16:29:31.088295 #23277]  INFO -- : SRWT Node: zs95kjpcs1
Request: get_configs
I, [2017-02-06T16:29:31.088515 #23277]  INFO -- : SRWT Node: zs95KLpcs1
Request: get_configs
I, [2017-02-06T16:29:31.157357 #23277]  INFO -- :
Running: /usr/sbin/corosync-cmapctl totem.cluster_name
I, [2017-02-06T16:29:31.157480 #23277]  INFO -- : CIB USER: hacluster,
groups:
I, [2017-02-06T16:29:32.271072 #23277]  INFO -- : Return Value: 0
::ffff:10.20.93.12 - - [06/Feb/2017:16:29:32 -0500]
"GET /remote/get_configs HTTP/1.1" 200 1077 1.1190
::ffff:10.20.93.12 - - [06/Feb/2017:16:29:32 -0500]
"GET /remote/get_configs HTTP/1.1" 200 1077 1.1193
zs95kjpcs1.pokprv.stglabs.ibm.com - - [06/Feb/2017:16:29:31 EST]
"GET /remote/get_configs HTTP/1.1" 200 1077
- -> /remote/get_configs

Again, the pcsd query commands are returning quickly now, so all this data
may not be of any use now.

Comments / suggestions welcome.  Thanks...

Scott Greenlese ... IBM KVM on System Z Solutions Test,  Poughkeepsie, N.Y.
  INTERNET:  swgreenl at us.ibm.com

From:	Tomas Jelinek <tojeline at redhat.com>
To:	users at clusterlabs.org
Date:	02/06/2017 04:55 AM
Subject:	Re: [ClusterLabs] pcsd 99% CPU

Dne 3.2.2017 v 22:08 Scott Greenlese napsal(a):
> Hi all..
>
> Over the past few days, I noticed that pcsd and ruby process is pegged
> at 99% CPU, and commands such as
> pcs status pcsd take up to 5 minutes to complete. On all active cluster
> nodes, top shows:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 27225 haclust+ 20 0 116324 91600 23136 R 99.3 0.1 1943:40 cib
> 23277 root 20 0 12.868g 8.176g 8460 S 99.7 13.0 407:44.18 ruby
>
> The system log indicates High CIB load detected over the past 2 days:
>
> [root at zs95kj ~]# grep "High CIB load detected" /var/log/messages |grep
> "Feb 3" |wc -l
> 1655
> [root at zs95kj ~]# grep "High CIB load detected" /var/log/messages |grep
> "Feb 2" |wc -l
> 1658
> [root at zs95kj ~]# grep "High CIB load detected" /var/log/messages |grep
> "Feb 1" |wc -l
> 147
> [root at zs95kj ~]# grep "High CIB load detected" /var/log/messages |grep
> "Jan 31" |wc -l
> 444
> [root at zs95kj ~]# grep "High CIB load detected" /var/log/messages |grep
> "Jan 30" |wc -l
> 352
>
>
> The first entries logged on Feb 2 started around 8:42am ...
>
> Feb 2 08:42:12 zs95kj crmd[27233]: notice: High CIB load detected:
0.974333
>
> This happens to coincide with the time that I had caused a node fence
> (off) action by creating a iface-bridge resources and specified
> a non-existent vlan slave interface (reported to the group yesterday in
> a separate email thread). It also happened to cause me to lose
> quorum in the cluster, because 2 of my 5 cluster nodes were already
> offline.
>
> My cluster currently has just over 200 VirtualDomain resources to
> manage, plus one iface-bridge resource and one iface-vlan resource.
> Both of which are currently configured properly and operational.
>
> I would appreciate some guidance how to proceed with debugging this
> issue. I have not taken any recovery actions yet.

Checking /var/log/pcsd/pcsd.log to see what pcsd is actually doing might
be a good start. What pcsd version do you have?

> I considered stopping the cluster, recycling pcsd.service on all nodes,
> restarting cluster... and also, reboot the nodes, if
> necessary. But, didn't want to clear it yet in case there's anything I
> can capture while in this state.

Restarting just pcsd might be enough.

Tomas

>
> Thanks..
>
> Scott Greenlese ... KVM on System Z - Solutions Test, Poughkeepsie, N.Y.
> INTERNET: swgreenl at us.ibm.com
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

_______________________________________________
Users mailing list: Users at clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20170206/5dc98d6d/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20170206/5dc98d6d/attachment-0003.gif>