[ClusterLabs] large cluster - failure recovery

Wed Nov 4 18:55:14 UTC 2015

On 04/11/15 01:50 PM, Radoslaw Garbacz wrote:
> Hi,
> 
> I have a cluster of 32 nodes, and after some tuning was able to have it
> started and running,

This is not supported by RH for a reasons; it's hard to get the timing
right. SUSE supports up to 32 nodes, but they must be doing some serious
magic behind the scenes.

I would *strongly* recommend dividing this up into a few smaller
clusters... 8 nodes per cluster would be max I'd feel comfortable with.
You need your cluster to solve more problems than it causes...

> but it does not recover from a node disconnect-connect failure.
> It regains quorum, but CIB does not recover to a synchronized state and
> "cibadmin -Q" times out.
> 
> Is there anything with corosync or pacemaker parameters I can do to make
> it recover from such a situation
> (everything works for smaller clusters).
> 
> In my case it is OK for a node to disconnect (all the major resources
> are shutdown)
> and later reconnect the cluster (the running monitoring agent will
> cleanup and restart major resources if needed),
> so I do not have STONITH configured.
> 
> Details:
> OS: CentOS 6
> Pacemaker: Pacemaker 1.1.9-1512.el6

Upgrade.

> Corosync: Corosync Cluster Engine, version '2.3.2'

This is not supported on EL6 at all. Please stick with corosync 1.4 and
use the cman pluging as the quorum provider.

> Corosync configuration:
>         token: 10000
>         #token_retransmits_before_loss_const: 10
>         consensus: 15000
>         join: 1000
>         send_join: 80
>         merge: 1000
>         downcheck: 2000
>         #rrp_problem_count_timeout: 5000
>         max_network_delay: 150 # for azure
> 
> 
> Some logs:
> 
> [...]
> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> cib_process_diff:         Diff 1.9254.1 -> 1.9255.1 from local not
> applied to 1.9275.1: current "epoch" is greater than required
> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> update_cib_cache_cb:      [cib_diff_notify] Patch aborted: Application
> of an update diff failed (-1006)
> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> cib_process_diff:         Diff 1.9255.1 -> 1.9256.1 from local not
> applied to 1.9275.1: current "epoch" is greater than required
> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> update_cib_cache_cb:      [cib_diff_notify] Patch aborted: Application
> of an update diff failed (-1006)
> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> cib_process_diff:         Diff 1.9256.1 -> 1.9257.1 from local not
> applied to 1.9275.1: current "epoch" is greater than required
> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> update_cib_cache_cb:      [cib_diff_notify] Patch aborted: Application
> of an update diff failed (-1006)
> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> cib_process_diff:         Diff 1.9257.1 -> 1.9258.1 from local not
> applied to 1.9275.1: current "epoch" is greater than required
> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> update_cib_cache_cb:      [cib_diff_notify] Patch aborted: Application
> of an update diff failed (-1006)
> [...]
> 
> [...]
> Nov 04 17:43:24 [12176] ip-10-109-145-175    crm_mon:    error:
> cib_native_perform_op_delegate:         Couldn't perform cib_query
> operation (timeout=120s): Operation already in progress (-114)
> Nov 04 17:43:24 [12176] ip-10-109-145-175    crm_mon:    error:
> get_cib_copy:   Couldnt retrieve the CIB
> Nov 04 17:43:24 [12176] ip-10-109-145-175    crm_mon:    error:
> cib_native_perform_op_delegate:         Couldn't perform cib_query
> operation (timeout=120s): Operation already in progress (-114)
> Nov 04 17:43:24 [12176] ip-10-109-145-175    crm_mon:    error:
> get_cib_copy:   Couldnt retrieve the CIB
> Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice  [QUORUM]
> Members[32]: 3 27 11 29 23 21 24 9 17 12 32 13 2 10 16 15 6 28 19 1 22 26 5\
> Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice  [QUORUM]
> Members[32]: 14 20 31 30 8 25 18 7 4
> Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice  [MAIN  ]
> Completed service synchronization, ready to provide service.
> Nov 04 18:06:55 [10599] ip-10-109-145-175 corosync notice  [QUORUM]
> Members[32]: 3 27 11 29 23 21 24 9 17 12 32 13 2 10 16 15 6 28 19 1 22 26 5\
> Nov 04 18:06:55 [10599] ip-10-109-145-175 corosync notice  [QUORUM]
> Members[32]: 14 20 31 30 8 25 18 7 4
> [...]
> 
> [...]
> Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng:   notice:
> update_cib_cache_cb:    [cib_diff_notify] Patch aborted: Application of
> an update diff failed (-1006)
> Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng:     info:
> apply_xml_diff:         Digest mis-match: expected
> 01192e5118739b7c33c23f7645da3f45, calculated
> f8028c0c98526179ea5df0a2ba0d09de
> Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng:  warning:
> cib_process_diff:       Diff 1.15046.2 -> 1.15046.3 from local not
> applied to 1.15046.2: Failed application of an update diff
> Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng:   notice:
> update_cib_cache_cb:    [cib_diff_notify] Patch aborted: Application of
> an update diff failed (-1006)
> Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng:   notice:
> cib_process_diff:       Diff 1.15046.2 -> 1.15046.3 from local not
> applied to 1.15046.3: current "num_updates" is greater than required
> [...]
> 
> 
> ps. Sorry if should posted on corosync newsgroup, just the CIB
> synchronization fails, so this group seemed to me the right place.

All of the HA mailing lists are merging into Cluster labs. This is the
right place to ask.

> -- 
> Best Regards,
> 
> Radoslaw Garbacz

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?