<div dir="ltr"><div><div><div><div><div><div><div>Hi,<br><br></div>Thank you for the advice. Indeed, seems like Pacemaker Remote will solve my big cluster problem.<br><br></div>With regard to your questions about my current solution, I scale corosync parameters based on the number of nodes, additionally modifying some of the kernel network parameters. Tests I did let me select certain corosync settings, which works, but are possibly not the best (cluster is quite slow when reacting to some quorum related events).<br><br></div>The problem seems to be only related to cluster start, once running, any operations such as node lost/reconnect, agents creation/start/stop work well. Memory and network seems important with regard to the hardware.<br><br></div>Below are settings I used for my latest test (the largest working cluster I tried):<br></div><div>* latest pacemaker/corosync<br></div>* 55 c3.4xlarge nodes (amazon cloud)<br></div><div>* 55 active nodes, 552 resources in a cluster<br></div>* kernel settings:<br>net.core.wmem_max=12582912<br>net.core.rmem_max=12582912<br>net.ipv4.tcp_rmem= 10240 87380 12582912<br>net.ipv4.tcp_wmem= 10240 87380 12582912<br>net.ipv4.tcp_window_scaling = 1<br>net.ipv4.tcp_timestamps = 1<br>net.ipv4.tcp_sack = 1<br>net.ipv4.tcp_no_metrics_save = 1<br>net.core.netdev_max_backlog = 5000<br><br></div>* corosync settings:<br><div><div><div><div><div><div><div>token: 12000<br>consensus: 16000<br>join: 1500<br>send_join: 80<br>merge: 2000<br>downcheck: 2000<br>max_network_delay: 150 # for azure<br><br></div><div>Best regards,<br></div><div><br></div></div></div></div></div></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Aug 23, 2016 at 12:00 PM, Ken Gaillot <span dir="ltr"><<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On 08/23/2016 11:46 AM, Klaus Wenninger wrote:<br>

> On 08/23/2016 06:26 PM, Radoslaw Garbacz wrote:<br>

>> Hi,<br>

>><br>

>> I would like to ask for settings (and hardware requirements) to have<br>

>> corosync/pacemaker running on about 100 nodes cluster.<br>

> Actually I had thought that 16 would be the limit for full<br>

> pacemaker-cluster-nodes.<br>

> For larger deployments pacemaker-remote should be the way to go. Were<br>

> you speaking of a cluster with remote-nodes?<br>

><br>

> Regards,<br>

> Klaus<br>

>><br>

>> For now some nodes get totally frozen (high CPU, high network usage),<br>

>> so that even login is not possible. By manipulating<br>

>> corosync/pacemaker/kernel parameters I managed to run it on ~40 nodes<br>

>> cluster, but I am not sure which parameters are critical, how to make<br>

>> it more responsive and how to make the number of nodes even bigger.<br>

<br>

</span>16 is a practical limit without special hardware and tuning, so that's<br>

often what companies that offer support for clusters will accept.<br>

<br>

I know people have gone well higher than 16 with a lot of optimization,<br>

but I think somewhere between 32 and 64 corosync can't keep up with the<br>

messages. Your 40 nodes sounds about right. I'd be curious to hear what<br>

you had to do (with hardware, OS tuning, and corosync tuning) to get<br>

that far.<br>

<br>

As Klaus mentioned, Pacemaker Remote is the preferred way to go beyond<br>

that currently:<br>

<br>

<a href="http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Remote/index.html" rel="noreferrer" target="_blank">http://clusterlabs.org/doc/en-<wbr>US/Pacemaker/1.1-pcs/html-<wbr>single/Pacemaker_Remote/index.<wbr>html</a><br>

<div class="HOEnZb"><div class="h5"><br>

>> Thanks,<br>

>><br>

>> --<br>

>> Best Regards,<br>

>><br>

>> Radoslaw Garbacz<br>

>> XtremeData Incorporation<br>

<br>

______________________________<wbr>_________________<br>

Users mailing list: <a href="mailto:Users@clusterlabs.org">Users@clusterlabs.org</a><br>

<a href="http://clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://clusterlabs.org/<wbr>mailman/listinfo/users</a><br>

<br>

Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/<wbr>doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>

</div></div></blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div>Best Regards,<br><br>Radoslaw Garbacz<br></div>XtremeData Incorporation<br></div></div>

</div>