<div dir="ltr">Thank you Ken and Digimer for all your suggestions.<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Nov 4, 2015 at 2:32 PM, Ken Gaillot <span dir="ltr"><<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On 11/04/2015 12:55 PM, Digimer wrote:<br>

> On 04/11/15 01:50 PM, Radoslaw Garbacz wrote:<br>

>> Hi,<br>

>><br>

>> I have a cluster of 32 nodes, and after some tuning was able to have it<br>

>> started and running,<br>

><br>

> This is not supported by RH for a reasons; it's hard to get the timing<br>

> right. SUSE supports up to 32 nodes, but they must be doing some serious<br>

> magic behind the scenes.<br>

><br>

> I would *strongly* recommend dividing this up into a few smaller<br>

> clusters... 8 nodes per cluster would be max I'd feel comfortable with.<br>

> You need your cluster to solve more problems than it causes...<br>

<br>

</span>Hi Radoslaw,<br>

<br>

RH supports up to 16. 32 should be possible with recent<br>

pacemaker+corosync versions and careful tuning, but it's definitely<br>

leading-edge.<br>

<br>

An alternative with pacemaker 1.1.10+ (1.1.12+ recommended) is Pacemaker<br>

Remote, which easily scales to dozens of nodes:<br>

<a href="http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Remote/index.html" rel="noreferrer" target="_blank">http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Remote/index.html</a><br>

<br>

Pacemaker Remote is a really good approach once you start pushing the<br>

limits of cluster nodes. Probably better than trying to get corosync to<br>

handle more nodes. (There are long-term plans for improving corosync's<br>

scalability, but that doesn't help you now.)<br>

<span class=""><br>

>> but it does not recover from a node disconnect-connect failure.<br>

>> It regains quorum, but CIB does not recover to a synchronized state and<br>

>> "cibadmin -Q" times out.<br>

>><br>

>> Is there anything with corosync or pacemaker parameters I can do to make<br>

>> it recover from such a situation<br>

>> (everything works for smaller clusters).<br>

>><br>

>> In my case it is OK for a node to disconnect (all the major resources<br>

>> are shutdown)<br>

>> and later reconnect the cluster (the running monitoring agent will<br>

>> cleanup and restart major resources if needed),<br>

>> so I do not have STONITH configured.<br>

>><br>

>> Details:<br>

>> OS: CentOS 6<br>

>> Pacemaker: Pacemaker 1.1.9-1512.el6<br>

><br>

> Upgrade.<br>

<br>

</span>If you can upgrade to the latest CentOS 6.7, you can get a much newer<br>

Pacemaker. But Pacemaker is probably not limiting your cluster nodes;<br>

the newer version's main benefit would be Pacemaker Remote support. (Of<br>

course there are plenty of bug fixes and new features as well.)<br>

<span class=""><br>

>> Corosync: Corosync Cluster Engine, version '2.3.2'<br>

><br>

> This is not supported on EL6 at all. Please stick with corosync 1.4 and<br>

> use the cman pluging as the quorum provider.<br>

<br>

</span>CentOS is self-supported anyway, so if you're willing to handle your own<br>

upgrades and such, nothing wrong with compiling. But corosync is up to<br>

2.3.5 so you're already behind. :) I'd recommend compiling libqb 0.17.2<br>

if you're compiling recent corosync and/or pacemaker.<br>

<br>

Alternatively, CentOS 7 will have recent versions of everything.<br>

<div class="HOEnZb"><div class="h5"><br>

>> Corosync configuration:<br>

>>         token: 10000<br>

>>         #token_retransmits_before_loss_const: 10<br>

>>         consensus: 15000<br>

>>         join: 1000<br>

>>         send_join: 80<br>

>>         merge: 1000<br>

>>         downcheck: 2000<br>

>>         #rrp_problem_count_timeout: 5000<br>

>>         max_network_delay: 150 # for azure<br>

>><br>

>><br>

>> Some logs:<br>

>><br>

>> [...]<br>

>> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:<br>

>> cib_process_diff:         Diff 1.9254.1 -> 1.9255.1 from local not<br>

>> applied to 1.9275.1: current "epoch" is greater than required<br>

>> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:<br>

>> update_cib_cache_cb:      [cib_diff_notify] Patch aborted: Application<br>

>> of an update diff failed (-1006)<br>

>> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:<br>

>> cib_process_diff:         Diff 1.9255.1 -> 1.9256.1 from local not<br>

>> applied to 1.9275.1: current "epoch" is greater than required<br>

>> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:<br>

>> update_cib_cache_cb:      [cib_diff_notify] Patch aborted: Application<br>

>> of an update diff failed (-1006)<br>

>> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:<br>

>> cib_process_diff:         Diff 1.9256.1 -> 1.9257.1 from local not<br>

>> applied to 1.9275.1: current "epoch" is greater than required<br>

>> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:<br>

>> update_cib_cache_cb:      [cib_diff_notify] Patch aborted: Application<br>

>> of an update diff failed (-1006)<br>

>> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:<br>

>> cib_process_diff:         Diff 1.9257.1 -> 1.9258.1 from local not<br>

>> applied to 1.9275.1: current "epoch" is greater than required<br>

>> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:<br>

>> update_cib_cache_cb:      [cib_diff_notify] Patch aborted: Application<br>

>> of an update diff failed (-1006)<br>

>> [...]<br>

>><br>

>> [...]<br>

>> Nov 04 17:43:24 [12176] ip-10-109-145-175    crm_mon:    error:<br>

>> cib_native_perform_op_delegate:         Couldn't perform cib_query<br>

>> operation (timeout=120s): Operation already in progress (-114)<br>

>> Nov 04 17:43:24 [12176] ip-10-109-145-175    crm_mon:    error:<br>

>> get_cib_copy:   Couldnt retrieve the CIB<br>

>> Nov 04 17:43:24 [12176] ip-10-109-145-175    crm_mon:    error:<br>

>> cib_native_perform_op_delegate:         Couldn't perform cib_query<br>

>> operation (timeout=120s): Operation already in progress (-114)<br>

>> Nov 04 17:43:24 [12176] ip-10-109-145-175    crm_mon:    error:<br>

>> get_cib_copy:   Couldnt retrieve the CIB<br>

>> Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice  [QUORUM]<br>

>> Members[32]: 3 27 11 29 23 21 24 9 17 12 32 13 2 10 16 15 6 28 19 1 22 26 5\<br>

>> Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice  [QUORUM]<br>

>> Members[32]: 14 20 31 30 8 25 18 7 4<br>

>> Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice  [MAIN  ]<br>

>> Completed service synchronization, ready to provide service.<br>

>> Nov 04 18:06:55 [10599] ip-10-109-145-175 corosync notice  [QUORUM]<br>

>> Members[32]: 3 27 11 29 23 21 24 9 17 12 32 13 2 10 16 15 6 28 19 1 22 26 5\<br>

>> Nov 04 18:06:55 [10599] ip-10-109-145-175 corosync notice  [QUORUM]<br>

>> Members[32]: 14 20 31 30 8 25 18 7 4<br>

>> [...]<br>

>><br>

>> [...]<br>

>> Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng:   notice:<br>

>> update_cib_cache_cb:    [cib_diff_notify] Patch aborted: Application of<br>

>> an update diff failed (-1006)<br>

>> Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng:     info:<br>

>> apply_xml_diff:         Digest mis-match: expected<br>

>> 01192e5118739b7c33c23f7645da3f45, calculated<br>

>> f8028c0c98526179ea5df0a2ba0d09de<br>

>> Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng:  warning:<br>

>> cib_process_diff:       Diff 1.15046.2 -> 1.15046.3 from local not<br>

>> applied to 1.15046.2: Failed application of an update diff<br>

>> Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng:   notice:<br>

>> update_cib_cache_cb:    [cib_diff_notify] Patch aborted: Application of<br>

>> an update diff failed (-1006)<br>

>> Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng:   notice:<br>

>> cib_process_diff:       Diff 1.15046.2 -> 1.15046.3 from local not<br>

>> applied to 1.15046.3: current "num_updates" is greater than required<br>

>> [...]<br>

>><br>

>><br>

>> ps. Sorry if should posted on corosync newsgroup, just the CIB<br>

>> synchronization fails, so this group seemed to me the right place.<br>

><br>

> All of the HA mailing lists are merging into Cluster labs. This is the<br>

> right place to ask.<br>

><br>

>> --<br>

>> Best Regards,<br>

>><br>

>> Radoslaw Garbacz<br>

><br>

<br>

<br>

</div></div><div class="HOEnZb"><div class="h5">_______________________________________________<br>

Users mailing list: <a href="mailto:Users@clusterlabs.org">Users@clusterlabs.org</a><br>

<a href="http://clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://clusterlabs.org/mailman/listinfo/users</a><br>

<br>

Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>

</div></div></blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature"><div dir="ltr"><div>Best Regards,<br><br>Radoslaw Garbacz<br></div>XtremeData Incorporation<br></div></div>

</div>