[ClusterLabs] Maximum cluster size with Pacemaker 2.x and Corosync 3.x, and scaling to hundreds of nodes

Thu Jul 30 10:43:21 EDT 2020

On Wed, 2020-07-29 at 23:12 +0000, Toby Haynes wrote:
> In Corosync 1.x there was a limit on the maximum number of active
> nodes in a corosync cluster - broswing the mailing list says 64
> hosts. The Pacemaker 1.1 documentation says scalability goes up to 16
> nodes. The Pacemaker 2.0 documentation says the same, although I
> can't find a maximum number of nodes in Corosync 3.

My understanding is that there is no theoretical limit, only practical
limits, so giving a single number is somewhat arbitrary.

There is a huge difference between full cluster nodes (running corosync
and all pacemaker daemons) and Pacemaker Remote nodes (running only
pacemaker-remoted).

Corosync uses a ring model where a token has to be passed in a very
short amount of time, and also has message guarantees (i.e. every node
has to confirm receiving a message before it is made available), so
there is a low practical limit to full cluster nodes. The 16 or 32
number comes from what enterprise providers are willing to support, and
is a good ballpark for a real-world comfort zone. Even at 32 you need a
dedicated fast network and likely some tuning tweaks. Going beyond that
is possible but depends on hardware and tuning, and becomes sensitive
to slight disturbances.

Pacemaker Remote nodes on the other hand are lightweight. They
communicate with only a single cluster node, with relatively low
traffic. The upper bound is unknown; some people report getting strange
errors with as few as 40 remote nodes, while others run over 100 with
no problems. So it may well depend on network and hardware capabilities
at high numbers, and you can run far more in VMs or containers than on
bare metal, since traffic will (usually) be internal rather than over
the network.

I would expect a cluster with 16-32 full nodes and several hundred
remotes (maybe even thousands in VMs or containers) to be feasible with
the right hardware and tuning.

Since remotes don't run all the daemons, they can't do things like
directly execute fence devices or contribute to cluster quorum, but
remotes on bare metal or VMs are not really in a hierarchy as far as
the services being clustered go. A resource can move between cluster
and remote nodes, and a remote's connection can move from one cluster
node to another without interrupting the services on the remote.

> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Remote/ discusses deployments up to 64 hosts but it appears to reference
> Pacemaker 1.16.
>  
> With the arrival of Corossync 3.x (and Pacemaker 2.x) how large a
> cluster can be supported? If we want to get to a cluster with 100+
> nodes, what are the best design approaches, especially if there is no
> clear hierarchy to the nodes in use (i.e. all of the hosts are
> important!).
>  
> Are there performance implications when comparing the operation of a
> pacemaker remote node to a full stack pacemaker node?
>  
> Thanks,
> 
> Toby Haynes
-- 
Ken Gaillot <kgaillot at redhat.com>