[ClusterLabs] Pacemaker 1.1.17 Release Candidate 4 (likely final)

Wed Jun 21 14:10:04 UTC 2017

On 06/21/2017 02:58 AM, Ferenc Wágner wrote:
> Ken Gaillot <kgaillot at redhat.com> writes:
> 
>> The most significant change in this release is a new cluster option to
>> improve scalability.
>>
>> As users start to create clusters with hundreds of resources and many
>> nodes, one bottleneck is a complete reprobe of all resources (for
>> example, after a cleanup of all resources).
> 
> Hi,
> 
> Does crm_resource --cleanup without any --resource specified do this?
> Does this happen any other (automatic or manual) way?

Correct.

A full probe also happens at startup, but that generally is spread out
over enough time not to matter.

Prior to this release, a full write-out of all node attributes also
occurs whenever a node joins the cluster, which has similar
characteristic (due to fail counts for each resource on each node). With
this release, that is skipped when using the corosync 2 stack, since we
have extra guarantees there that make it unnecessary.

> 
>> This can generate enough CIB updates to get the crmd's CIB connection
>> dropped for not processing them quickly enough.
> 
> Is this a catastrophic scenario, or does the cluster recover gently?

The crmd exits, leading to node fencing.

>> This bottleneck has been addressed with a new cluster option,
>> cluster-ipc-limit, to raise the threshold for dropping the connection.
>> The default is 500. The recommended value is the number of nodes in the
>> cluster multiplied by the number of resources.
> 
> I'm running a production cluster with 6 nodes and 159 resources (ATM),
> which gives almost twice the above default.  What symptoms should I
> expect to see under 1.1.16?  (1.1.16 has just been released with Debian
> stretch.  We can't really upgrade it, but changing the built-in default
> is possible if it makes sense.)

Even twice the threshold is fine in most clusters, because it's highly
unlikely that all probe results will come back at exactly the same time.
The case that prompted this involved 200 resources whose monitor action
was a simple pid check, so they executed near instantaneously (on 9 nodes).

The symptom is an "Evicting client" log message from the cib, listing
the pid of the crmd, followed by the crmd exiting.

Changing the compiled-in default on older versions is a potential
workaround (search for 500 in lib/common/ipc.c), but not ideal since it
applies to all clusters (even those too small to need it) and all
clients (including command-line clients, whereas the new
cluster-ipc-limit option only affects connections from other cluster
daemons).

The only real downside of increasing the threshold is the potential for
increased memory usage (which is why there's a limit to begin with, to
avoid an unresponsive client from causing a memory surge on a cluster
daemon). The usage is dependent on the size of the queued IPC messages
-- for probe results, it should be under 1K per result. The memory is
only used if the queue actually backs up (it's not pre-allocated).