[ClusterLabs] [Pacemaker] large cluster - failure recovery

Thu Nov 19 15:32:21 UTC 2015

Thank you.

Indeed the latest corosync and pacemaker does work with large clusters -
some tuning is required though.
By working I mean also recovering after a node loss/regain, which was the
major issue before, when the corosync worked (established recovered
membership), but pacemaker was not able to sync CIB - it still needs some
time and CPU power to do so though.

It works for me for a 34 nodes cluster with a few hundreds of resources (I
haven't tested bigger yet).

On Thu, Nov 19, 2015 at 2:43 AM, Cédric Dufour - Idiap Research Institute <
cedric.dufour at idiap.ch> wrote:

> [coming over from the old mailing list pacemaker at oss.clusterlabs.org;
> sorry for any thread discrepancy]
>
> Hello,
>
> We've also setup a fairly large cluster - 24 nodes / 348 resources
> (pacemaker 1.1.12, corosync 1.4.7) - and pacemaker 1.1.12 is definitely the
> minimum version you'll want, thanks to changes on how the CIB is handled.
>
> If you're going to handle a large number (~several hundreds) of resources
> as well, you may need to concern yourself with the CIB size as well.
> You may want to have a look at pp.17-18 of the document I wrote to
> describe our setup: http://cedric.dufour.name/cv/download/idiap_havc2.pdf
>
> Currently, I would consider that with 24 nodes / 348 resources, we are
> close to the limit of what our cluster can handle, the bottleneck being
> CPU(core) power for CIB/CRM handling. Our "worst performing nodes" (out of
> the 24 in the cluster) are Xeon E7-2830 @ 2.13GHz.
> The main issue we currently face in when a DC is taken out and a new one
> must be elected: CPU goes 100% for several tens of seconds (even minutes),
> during which the cluster is totally unresponsive. Fortunately, resources
> themselves just seat tight and remain available (I can't say about those
> who would need to be migrated because being collocated with the DC; we
> manually avoid that situation when performing maintenance that may affect
> the DC)
>
> I'm looking forwards to migrate to corosync 2+ (there are some backports
> available for debian/Jessie) and see it this would allow to push the limit
> further. Unfortunately, I can't say for sure as I have only a limited
> understanding of how Pacemaker/Corosync work and where CPU is bond to
> become a bottleneck.
>
> [UPDATE] Thanks Ken for the Pacemaker Remote pointer; I'm head on to have
> a look at that
>
> 'Hope it can help,
>
> Cédric
>
> On 04/11/15 23:26, Radoslaw Garbacz wrote:
>
> Thank you, will give it a try.
>
> On Wed, Nov 4, 2015 at 12:50 PM, Trevor Hemsley <themsley at voiceflex.com>
> wrote:
>
>> On 04/11/15 18:41, Radoslaw Garbacz wrote:
>> > Details:
>> > OS: CentOS 6
>> > Pacemaker: Pacemaker 1.1.9-1512.el6
>> > Corosync: Corosync Cluster Engine, version '2.3.2'
>>
>> yum update
>>
>> Pacemaker is currently 1.1.12 and corosync 1.4.7 on CentOS 6. There were
>> major improvements in speed with later versions of pacemaker.
>>
>> Trevor
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
>
> --
> Best Regards,
>
> Radoslaw Garbacz
> XtremeData Incorporation
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.orghttp://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>

-- 
Best Regards,

Radoslaw Garbacz
XtremeData Incorporation
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20151119/10ac536a/attachment-0002.html>