[Pacemaker] Adding 100 Resources Locks Cluster for Several Minutes

Sun Jan 29 19:41:15 EST 2012

*** Adding 100 Resources Locks Cluster for Several Minutes
Adding 100 resources to the cluster causes the cib process to jump to 100% when viewed with the "top" command (all nodes), and the cluster becomes unresponsive to commands like "crm status" or "cibadmin -Q" for several minutes.
cibadmin -R --scope resources -x rsrc100.xml
The following listing shows that all the resources were allocated to node 11, no other nodes received resources even though they were online, and every entry listed an error after approximately 10 minutes elapsed from when they were added to the cluster.

[root at pcs_linuxha_11 ~]# crm status

============

Last updated: Fri Jan 27 19:21:12 2012

Last change: Fri Jan 27 19:14:35 2012 via cibadmin on pcs_linuxha_1

Stack: openais

Current DC: pcs_linuxha_1 - partition with quorum

Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558

15 Nodes configured, 15 expected votes

100 Resources configured.

============

Online: [ pcs_linuxha_1 pcs_linuxha_2 pcs_linuxha_3 pcs_linuxha_4 pcs_linuxha_5 pcs_linuxha_6 pcs_linuxha_7 pcs_linuxha_8 pcs_linuxha_9 pcs_linuxha_10 pcs_linuxha_11 pcs_linuxha_12 pcs_linuxha_13 pcs_linuxha_14 pcs_linuxha_15 ]

 pcs_resource_1 (ocf::idirect:ppct):   Started pcs_linuxha_11

 pcs_resource_2 (ocf::idirect:ppct):   Started pcs_linuxha_11

...

 pcs_resource_100      (ocf::idirect:ppct):   Started pcs_linuxha_11

Failed actions:

    pcs_resource_1_monitor_0 (node=pcs_linuxha_11, call=-1, rc=1, status=Timed Out): unknown error

    pcs_resource_2_monitor_0 (node=pcs_linuxha_11, call=-1, rc=1, status=Timed Out): unknown error

...

    pcs_resource_100_monitor_0 (node=pcs_linuxha_11, call=-1, rc=1, status=Timed Out): unknown error

[root at pcs_linuxha_11 ~]#

Update: Adding an additional 300 resources caused the cib process to go to 100% cpu utilization for approximately 17 minutes, and caused the designated controller (DC) to switch from node 1 to node 5. Many errors were logged at the 17 minute point on output of crm status, although the load was split amongst the cluster instead of all being loaded on node 11 as with the first 100 resources.

_____________________________________________________
This electronic message and any files transmitted with it contains
information from iDirect, which may be privileged, proprietary
and/or confidential. It is intended solely for the use of the individual
or entity to whom they are addressed. If you are not the original
recipient or the person responsible for delivering the email to the
intended recipient, be advised that you have received this email
in error, and that any use, dissemination, forwarding, printing, or
copying of this email is strictly prohibited. If you received this email
in error, please delete it and immediately notify the sender.
_____________________________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120130/85cd9a93/attachment-0002.html>