[Pacemaker] Adding 100 Resources Locks Cluster for Several Minutes

Thu Feb 2 16:39:43 EST 2012

On Mon, Jan 30, 2012 at 11:41 AM, Gruen, Wolfgang <wgruen at idirect.net> wrote:
>
>
> *** Adding 100 Resources Locks Cluster for Several Minutes
>
> Adding 100 resources to the cluster causes the cib process to jump to 100%
> when viewed with the "top" command (all nodes), and the cluster becomes
> unresponsive to commands like "crm status" or "cibadmin -Q" for several
> minutes.

The cluster is working as hard as it can to clear the thousands of CIB
updates that result from adding that many resources.

Operations = R*N + 2*R, R=#resources, N=#nodes

For 300 resources, 15 nodes and your measurement of 17 minutes, thats
about 0.2s per operation.
Which isn't /horrible/ given the amount of work involved in each
operation.  No doubt we can do better.

Have you tried tuning the batch-limit parameter?

Were there any messages from the CIB about failure to apply an update diff?
If so, you might be affected by:
    https://github.com/ClusterLabs/pacemaker/commit/10e9e579ab032bde3938d7f3e13c414e297ba3e9

> cibadmin -R --scope resources -x rsrc100.xml
> The following listing shows that all the resources were allocated to node
> 11, no other nodes received resources even though they were online, and
> every entry listed an error after approximately 10 minutes elapsed from when
> they were added to the cluster.
>
> [root at pcs_linuxha_11 ~]# crm status
>
> ============
>
> Last updated: Fri Jan 27 19:21:12 2012
>
> Last change: Fri Jan 27 19:14:35 2012 via cibadmin on pcs_linuxha_1
>
> Stack: openais
>
> Current DC: pcs_linuxha_1 - partition with quorum
>
> Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
>
> 15 Nodes configured, 15 expected votes
>
> 100 Resources configured.
>
> ============
>
>
>
> Online: [ pcs_linuxha_1 pcs_linuxha_2 pcs_linuxha_3 pcs_linuxha_4
> pcs_linuxha_5 pcs_linuxha_6 pcs_linuxha_7 pcs_linuxha_8 pcs_linuxha_9
> pcs_linuxha_10 pcs_linuxha_11 pcs_linuxha_12 pcs_linuxha_13 pcs_linuxha_14
> pcs_linuxha_15 ]
>
>
>
>  pcs_resource_1 (ocf::idirect:ppct):   Started pcs_linuxha_11
>
>  pcs_resource_2 (ocf::idirect:ppct):   Started pcs_linuxha_11
>
> ...
>
>  pcs_resource_100      (ocf::idirect:ppct):   Started pcs_linuxha_11
>
>
>
> Failed actions:
>
>     pcs_resource_1_monitor_0 (node=pcs_linuxha_11, call=-1, rc=1,
> status=Timed Out): unknown error
>
>     pcs_resource_2_monitor_0 (node=pcs_linuxha_11, call=-1, rc=1,
> status=Timed Out): unknown error
>
> ...
>
>     pcs_resource_100_monitor_0 (node=pcs_linuxha_11, call=-1, rc=1,
> status=Timed Out): unknown error
>
> [root at pcs_linuxha_11 ~]#
>
>
>
> Update: Adding an additional 300 resources caused the cib process to go to
> 100% cpu utilization for approximately 17 minutes, and caused the designated
> controller (DC) to switch from node 1 to node 5. Many errors were logged at
> the 17 minute point on output of crm status, although the load was split
> amongst the cluster instead of all being loaded on node 11 as with the first
> 100 resources.
>
>
>
>
>
>
> _____________________________________________________
> This electronic message and any files transmitted with it contains
> information from iDirect, which may be privileged, proprietary
> and/or confidential. It is intended solely for the use of the individual
> or entity to whom they are addressed. If you are not the original
> recipient or the person responsible for delivering the email to the
> intended recipient, be advised that you have received this email
> in error, and that any use, dissemination, forwarding, printing, or
> copying of this email is strictly prohibited. If you received this email
> in error, please delete it and immediately notify the sender.
> _____________________________________________________
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>