[ClusterLabs] pcsd 99% CPU

Fri Feb 3 16:08:26 EST 2017

Hi all..

Over the past few days, I noticed that pcsd and ruby process is pegged at
99% CPU, and commands such as
pcs status pcsd  take up to 5 minutes to complete.  On all active cluster
nodes, top shows:

PID 	USER 	 PR 	NI 	VIRT 	  RES 	  SHR    S  %CPU %MEM  TIME+
COMMAND
27225 	haclust+ 20 	0 	116324   91600 	   23136 R  99.3
0.1      1943:40 	    cib
23277   root       20        0          12.868g  8.176g   8460   S  99.7
13.0        407:44.18       ruby

The system log indicates High CIB load detected over the past 2 days:

[root at zs95kj ~]# grep "High CIB load detected" /var/log/messages |grep "Feb
3" |wc -l
1655
[root at zs95kj ~]# grep "High CIB load detected" /var/log/messages |grep "Feb
2" |wc -l
1658
[root at zs95kj ~]# grep "High CIB load detected" /var/log/messages |grep "Feb
1" |wc -l
147
[root at zs95kj ~]# grep "High CIB load detected" /var/log/messages |grep "Jan
31" |wc -l
444
[root at zs95kj ~]# grep "High CIB load detected" /var/log/messages |grep "Jan
30" |wc -l
352

The first entries logged on Feb 2 started around 8:42am ...

Feb  2 08:42:12 zs95kj crmd[27233]:  notice: High CIB load detected:
0.974333

This happens to coincide with the time that I had caused a node fence (off)
action by creating a iface-bridge resources and specified
a non-existent vlan slave interface (reported to the group yesterday in a
separate email thread).   It also happened to cause me to lose
quorum in the cluster, because 2 of my 5 cluster nodes were already
offline.

My cluster currently has just over 200 VirtualDomain resources to manage,
plus one iface-bridge resource and one iface-vlan resource.
Both of which are currently configured properly and operational.

I would appreciate some guidance how to proceed with debugging this issue.
I have not taken any recovery actions yet.
I considered stopping the cluster, recycling pcsd.service on all nodes,
restarting cluster... and also, reboot the nodes, if
necessary.  But, didn't want to clear it yet in case there's anything I can
capture while in this state.

Thanks..

Scott Greenlese ... KVM on System Z -  Solutions Test,  Poughkeepsie, N.Y.
  INTERNET:  swgreenl at us.ibm.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20170203/a44d77ec/attachment-0002.html>