[Pacemaker] Configuration recommandations for (very?) large cluster

Cédric Dufour - Idiap Research Institute cedric.dufour at idiap.ch
Tue Aug 12 12:02:09 EDT 2014

On 12/08/14 07:52, Andrew Beekhof wrote:
> On 11 Aug 2014, at 10:10 pm, Cédric Dufour - Idiap Research Institute <cedric.dufour at idiap.ch> wrote:
>> Hello,
>> Thanks to Pacemaker 1.1.12, I have been able to setup a (very?) large cluster:
> Thats certainly up there as one of the biggest :)

Well, actually, I sized it down from 444 to 277 resources by merging 'VirtualDomain' and 'MailTo' RA/primitives into a custom/single 'LibvirtQemu' one.
CIB is now ~3MiB uncompressed / ~100kiB compressed. (also avoids the informational-only 'MailTo' RA to come burden the cluster)
'PCMK_ipc_buffer' at 2MiB might be overkill now... but I'd rather stay on the safe side.

Q: Are there adverse effects in keeping 'PCMK_ipc_buffer' high?

277 resources are:
 - 22 (cloned) network-health (ping) resources
 - 88 (cloned) stonith resources (I have 4 stonith devices)
 - 167 LibvirtQemu resources (83 "general-purpose" servers and 84 SGE-driven computation nodes)
(and more LibvirtQemu resources are expected to come)

> Have you checked pacemaker's CPU usage during startup/failover?  I'd be interested in your results.

I finally set  'batch-limit' set to 22 - the quantity of nodes - as it makes sense when enabling a new primitive, as all monitor operations get dispatched immediately to all nodes at once.

When bringing a standby node to life:

 - On the "waking" node (E5-2690v2): 167+5 resources monitoring operations get dispatched; the CPU load of the 'cib' process remains below 100% as the operations are executed, batched by 22 (though one can not see that "batching", the monitoring operations succeeding very quickly), and complete in ~2 seconds. With Pacemaker 1.1.7, the 'cib' load would have peaked to 100% even before the first monitoring operation started (because of the CIB refresh, I guess) and would remain so for several tens of seconds (often resulting in timeouts and monitoring operations failure)

 - On the DC node (E5-2690v2): the CPU would also remain below 100%, alternating between the 'cib', 'pengine' and 'crmd' process. The DC is back to IDLE within ~4 seconds.

I tried raising the 'batch-limit' to 50 and witnessed CPU load peaking at 100% while carrying out the same procedure, but all went well nonetheless.

While I still had the ~450 resources, I also "accidentally" brought all 22 nodes back to life together (well, actually started the DC alone and then started the remaining 21 nodes together). As could be expected, the DC got quite busy (dispatching/executing the ~450*22 monitoring operations on all nodes). It took 40 minutes for the cluster to stabilize. But it did stabilize, with no timeout and not monitor operations failure! A few "high CIB load detected / throttle down mode" messages popped up but all went well.

Q: Is there a way to favorize more powerful nodes for the DC (iow. push the DC "election" process in a preferred direction) ?

>> Last updated: Mon Aug 11 13:40:14 2014
>> Last change: Mon Aug 11 13:37:55 2014
>> Stack: classic openais (with plugin)
> I would at least try running it with corosync 2.x (no plugin)
> That will use CPG for messaging which should perform even better.

I'm running into a deadline now and will have to stick to 1.4.x for the moment. But as soon as I can free an old test Intel modular chassis I have around, I'll try backporting Coro 2.x from Debian/Experimental to Debian/Wheezy and see what gives.

>> Current DC: bc1hx5a05 - partition with quorum
>> Version: 1.1.12-561c4cf
>> 22 Nodes configured, 22 expected votes
>> 444 Resources configured
>> PS: 'corosync' (1.4.7) traffic goes through a 10GbE network, with strict QoS priority over all other traffic.
>> Are there recommended configuration tweaks I should not miss in such situation?
>> So far, I have:
>> - Raised the 'PCMK_ipc_buffer' size to 2MiB
>> - Lowered the 'batch-limit' to 10 (though I believe my setup could sustain the default 30)
> Yep, definitely worth trying the higher value.
> We _should_ automatically start throttling ourselves if things get too intense.

Yep. As mentioned above, I did see "high CIB load detected / throttle down mode" messages popup. Is this what you think about?

> Other than that, I would be making sure all the corosync.conf timeouts and other settings are appropriate.

Never paid much attention to it so far. But it seems to me the Debian defaults are quite conservative, especially more so given my 10GbE (~0.2ms latency) interconnect and the care I took in prioritizing Corosync traffic (thanks to switches QoS/GMB and Linux 'tc'):

    token: 3000
    token_retransmits_before_loss_const: 10
    join: 60
    consensus: 3600
    vsftype: none
    max_messages: 20
    secauth: off
    amf: disabled

Am I right?

PS: this work is being done within the concept of the BEAT european research project - https://www.beat-eu.org/ - which aims, among other things, to "develop an online and open platform to transparently and independently evaluate biometric systems against validated benchmarks". There shall be some "publication" about the infrastructure set up. If interested, I can keep you posted.



>> Thank you in advance for your response.
>> Best,
>> Cédric
>> -- 
>> Cédric Dufour @ Idiap Research Institute

More information about the Pacemaker mailing list