[ClusterLabs] [Pacemaker] large cluster - failure recovery

Thu Nov 19 08:43:36 UTC 2015

[coming over from the old mailing list pacemaker at oss.clusterlabs.org; sorry for any thread discrepancy]

Hello,

We've also setup a fairly large cluster - 24 nodes / 348 resources (pacemaker 1.1.12, corosync 1.4.7) - and pacemaker 1.1.12 is definitely the minimum version you'll want, thanks to changes on how the CIB is handled.

If you're going to handle a large number (~several hundreds) of resources as well, you may need to concern yourself with the CIB size as well.
You may want to have a look at pp.17-18 of the document I wrote to describe our setup: http://cedric.dufour.name/cv/download/idiap_havc2.pdf

Currently, I would consider that with 24 nodes / 348 resources, we are close to the limit of what our cluster can handle, the bottleneck being CPU(core) power for CIB/CRM handling. Our "worst performing nodes" (out of the 24 in the cluster) are Xeon E7-2830 @ 2.13GHz.
The main issue we currently face in when a DC is taken out and a new one must be elected: CPU goes 100% for several tens of seconds (even minutes), during which the cluster is totally unresponsive. Fortunately, resources themselves just seat tight and remain available (I can't say about those who would need to be migrated because being collocated with the DC; we manually avoid that situation when performing maintenance that may affect the DC)

I'm looking forwards to migrate to corosync 2+ (there are some backports available for debian/Jessie) and see it this would allow to push the limit further. Unfortunately, I can't say for sure as I have only a limited understanding of how Pacemaker/Corosync work and where CPU is bond to become a bottleneck.

[UPDATE] Thanks Ken for the Pacemaker Remote pointer; I'm head on to have a look at that

'Hope it can help,

Cédric

On 04/11/15 23:26, Radoslaw Garbacz wrote:
> Thank you, will give it a try.
>
> On Wed, Nov 4, 2015 at 12:50 PM, Trevor Hemsley <themsley at voiceflex.com <mailto:themsley at voiceflex.com>> wrote:
>
>     On 04/11/15 18:41, Radoslaw Garbacz wrote:
>     > Details:
>     > OS: CentOS 6
>     > Pacemaker: Pacemaker 1.1.9-1512.el6
>     > Corosync: Corosync Cluster Engine, version '2.3.2'
>
>     yum update
>
>     Pacemaker is currently 1.1.12 and corosync 1.4.7 on CentOS 6. There were
>     major improvements in speed with later versions of pacemaker.
>
>     Trevor
>
>     _______________________________________________
>     Pacemaker mailing list: Pacemaker at oss.clusterlabs.org <mailto:Pacemaker at oss.clusterlabs.org>
>     http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>     Project Home: http://www.clusterlabs.org
>     Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>     Bugs: http://bugs.clusterlabs.org
>
>
>
>
> -- 
> Best Regards,
>
> Radoslaw Garbacz
> XtremeData Incorporation
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20151119/da4e811b/attachment-0003.html>