<div dir="ltr"><div><div><div>Hi,<br><br></div>I have a cluster of 32 nodes, and after some tuning was able to have it started and running, <br>but it does not recover from a node disconnect-connect failure.<br>It regains quorum, but CIB does not recover to a synchronized state and "cibadmin -Q" times out.<br><br></div>Is there anything with corosync or pacemaker parameters I can do to make it recover from such a situation<br>(everything works for smaller clusters).<br><br></div>In my case it is OK for a node to disconnect (all the major resources are shutdown)<br>and later reconnect the cluster (the running monitoring agent will cleanup and restart major resources if needed),<br>so I do not have STONITH configured.<br clear="all"><div><br></div><div>Details:<br></div><div>OS: CentOS 6<br></div><div>Pacemaker: Pacemaker 1.1.9-1512.el6<br></div><div>Corosync: Corosync Cluster Engine, version '2.3.2'<br><br><br></div><div>Corosync configuration:<br> token: 10000<br> #token_retransmits_before_loss_const: 10<br> consensus: 15000<br> join: 1000<br> send_join: 80<br> merge: 1000<br> downcheck: 2000<br> #rrp_problem_count_timeout: 5000<br> max_network_delay: 150 # for azure<br><br><br></div>Some logs:<br><br>[...]<br>Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:
notice: cib_process_diff: Diff 1.9254.1 -> 1.9255.1 from
local not applied to 1.9275.1: current "epoch" is greater than required<br>Nov
04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application
of an update diff failed (-1006)<br>Nov 04 17:50:18 [7985]
ip-10-142-181-98 stonith-ng: notice: cib_process_diff: Diff
1.9255.1 -> 1.9256.1 from local not applied to 1.9275.1: current
"epoch" is greater than required<br>Nov 04 17:50:18 [7985]
ip-10-142-181-98 stonith-ng: notice: update_cib_cache_cb:
[cib_diff_notify] Patch aborted: Application of an update diff failed
(-1006)<br>Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
cib_process_diff: Diff 1.9256.1 -> 1.9257.1 from local not
applied to 1.9275.1: current "epoch" is greater than required<br>Nov 04
17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application
of an update diff failed (-1006)<br>Nov 04 17:50:18 [7985]
ip-10-142-181-98 stonith-ng: notice: cib_process_diff: Diff
1.9257.1 -> 1.9258.1 from local not applied to 1.9275.1: current
"epoch" is greater than required<br>Nov 04 17:50:18 [7985]
ip-10-142-181-98 stonith-ng: notice: update_cib_cache_cb:
[cib_diff_notify] Patch aborted: Application of an update diff failed
(-1006)<br>[...]<br><br>[...]<br>Nov 04 17:43:24 [12176] ip-10-109-145-175 crm_mon: error: cib_native_perform_op_delegate: Couldn't perform cib_query operation (timeout=120s): Operation already in progress (-114)<br>Nov 04 17:43:24 [12176] ip-10-109-145-175 crm_mon: error: get_cib_copy: Couldnt retrieve the CIB<br>Nov 04 17:43:24 [12176] ip-10-109-145-175 crm_mon: error: cib_native_perform_op_delegate: Couldn't perform cib_query operation (timeout=120s): Operation already in progress (-114)<br>Nov 04 17:43:24 [12176] ip-10-109-145-175 crm_mon: error: get_cib_copy: Couldnt retrieve the CIB<br>Nov
04 17:47:40 [10599] ip-10-109-145-175 corosync notice [QUORUM]
Members[32]: 3 27 11 29 23 21 24 9 17 12 32 13 2 10 16 15 6 28 19 1 22
26 5\<br>Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice [QUORUM] Members[32]: 14 20 31 30 8 25 18 7 4<br>Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice [MAIN ] Completed service synchronization, ready to provide service.<br>Nov
04 18:06:55 [10599] ip-10-109-145-175 corosync notice [QUORUM]
Members[32]: 3 27 11 29 23 21 24 9 17 12 32 13 2 10 16 15 6 28 19 1 22
26 5\<br>Nov 04 18:06:55 [10599] ip-10-109-145-175 corosync notice [QUORUM] Members[32]: 14 20 31 30 8 25 18 7 4<br>[...]<br><br>[...]<br>Nov
04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: notice:
update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of
an update diff failed (-1006)<br>Nov 04 18:21:15 [17749]
ip-10-178-149-131 stonith-ng: info: apply_xml_diff: Digest
mis-match: expected 01192e5118739b7c33c23f7645da3f45, calculated f8028c0c98526179ea5df0a2ba0d09de<br>Nov
04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: warning:
cib_process_diff: Diff 1.15046.2 -> 1.15046.3 from local not
applied to 1.15046.2: Failed application of an update diff<br>Nov 04
18:21:15 [17749] ip-10-178-149-131 stonith-ng: notice:
update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of
an update diff failed (-1006)<br>Nov 04 18:21:15 [17749]
ip-10-178-149-131 stonith-ng: notice: cib_process_diff: Diff
1.15046.2 -> 1.15046.3 from local not applied to 1.15046.3: current
"num_updates" is greater than required<br>[...]<br><br><br>ps.
Sorry if should posted on corosync newsgroup, just the CIB
synchronization fails, so this group seemed to me the right place.<br clear="all"><br>-- <br><div class="gmail_signature"><div dir="ltr"><div>Best Regards,<br><br>Radoslaw Garbacz<br></div></div></div>
</div>