[ClusterLabs] Antw: [EXT] Re: Maximum cluster size with Pacemaker 2.x and Corosync 3.x, and scaling to hundreds of nodes

Fri Jul 31 01:57:29 EDT 2020

>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 30.07.2020 um 16:43 in
Nachricht
<93b973947008b62c4848f8a799ddc3f0949451e8.camel at redhat.com>:
> On Wed, 2020‑07‑29 at 23:12 +0000, Toby Haynes wrote:
>> In Corosync 1.x there was a limit on the maximum number of active
>> nodes in a corosync cluster ‑ broswing the mailing list says 64
>> hosts. The Pacemaker 1.1 documentation says scalability goes up to 16
>> nodes. The Pacemaker 2.0 documentation says the same, although I
>> can't find a maximum number of nodes in Corosync 3.
> 
> My understanding is that there is no theoretical limit, only practical
> limits, so giving a single number is somewhat arbitrary.
> 
> There is a huge difference between full cluster nodes (running corosync
> and all pacemaker daemons) and Pacemaker Remote nodes (running only
> pacemaker‑remoted).
> 
> Corosync uses a ring model where a token has to be passed in a very
> short amount of time, and also has message guarantees (i.e. every node
> has to confirm receiving a message before it is made available), so
> there is a low practical limit to full cluster nodes. The 16 or 32
> number comes from what enterprise providers are willing to support, and
> is a good ballpark for a real‑world comfort zone. Even at 32 you need a

What I'd like to see is some table with recommended parameters, depending on
the number of nodes and the maximum acceptable network delay.

The other thing I'd like to see is a worl-wide histogram (x-axis: number of
nodes, y-axis: number of installations) of pacemaker clusters.
Here we have a configuration ot two 2-node clusters and one 3-node cluster.
Initially we had planned to make one 7-node cluster, but basically stability
(common fencing) and configuration issues (becoming complex) prevented that.

> dedicated fast network and likely some tuning tweaks. Going beyond that
> is possible but depends on hardware and tuning, and becomes sensitive
> to slight disturbances.
> 
> Pacemaker Remote nodes on the other hand are lightweight. They
> communicate with only a single cluster node, with relatively low
> traffic. The upper bound is unknown; some people report getting strange
> errors with as few as 40 remote nodes, while others run over 100 with
> no problems. So it may well depend on network and hardware capabilities

See the parameter table requested above.

> at high numbers, and you can run far more in VMs or containers than on
> bare metal, since traffic will (usually) be internal rather than over
> the network.
> 
> I would expect a cluster with 16‑32 full nodes and several hundred
> remotes (maybe even thousands in VMs or containers) to be feasible with
> the right hardware and tuning.

I wonder: Do such configurations have a lot of identical or similar resources,
or do they do massive load balancing, or do they run many different resources?

> 
> Since remotes don't run all the daemons, they can't do things like
> directly execute fence devices or contribute to cluster quorum, but
> remotes on bare metal or VMs are not really in a hierarchy as far as
> the services being clustered go. A resource can move between cluster
> and remote nodes, and a remote's connection can move from one cluster
> node to another without interrupting the services on the remote.

Regards,
Ulrich