[ClusterLabs Developers] Pacemaker: possibilities of schedulerd optimization

Wed Jul 15 16:07:18 EDT 2020

On Mon, 2020-07-06 at 14:27 +0000, Denis Koptev wrote:
> Hello!
>  
> My team uses Pacemaker as a cluster manager. Our configuration is
> large enough, therefore we have lots of resources and constraints for
> them.
> While using Pacemaker we noticed that its decision making algorithm
> works slow for large configurations. We gathered some flame graphs to
> determine where the problem could be.
> And according to the flame graph for schedulerd which is attached to
> this email we can say that the execution of decision making part
> takes a lot of time.
> In particular, stage7 which is responsible for applying ordering
> constraints and updating actions for all resources is complicated
> enough.
>  
> We did some investigation of Pacemaker code and we have some
> assumptions how we can potentially speed-up schedulerd.
> So I’d very appreciate if you share your thoughts and knowledge on
> this topic.
>  
> The main concern is that currently schedulerd re-calculates all meta-
> data once notification from controld is received.
> This requires querying the entire CIB and analyzing it in schedulerd
> code applying all constraints from scratch. However, it can be
> inefficient in case of small CIB changes.
> BTW, controld receives CIB diff, but still provides schedulerd with
> entire CIB.
> So, what if we could pass CIB diff to schedulerd instead of the
> entire CIB and modify schedulerd logic in order to store all previous
> meta-data and apply diff to it somehow?

The "somehow" would probably be as difficult as starting from scratch.
Changes in one part of the CIB can affect data parsed from other
sections -- for example, changing a cluster option can affect how
resource and operation meta-attributes are determined, and changing a
fail-count node attribute affects the resource object for the resource
that failed. Trying to enumerate and handle all those relationships
would be challenging.

Additionally, parts of the configuration can be controlled by rules,
which can be based on the current time of day or on the current value
of a node attribute, so those have to be re-evaluated every time.

Finally, the scheduler is designed so that each run is completely
independent. The benefits are that we can run regression tests via
crm_simulate on saved CIBs and know it's identical to what the
scheduler would do; having the snapshot of the CIB at the time a bug
occurred in a production cluster is sufficient for us to debug any
scheduler issue; and the scheduler can crash and respawn at any time
and resume proper functioning. This could potentially be worked around
when implementing your idea, but it would take a lot of extra effort.

> Knowing what changed (or what type/part of configuration changed) can
> potentially allow us to skip processing of some (large or small part
> of) resources and actions.
>  
> Could you tell me please, what do you think about it? Perhaps you
> already faced this issue and had some ideas or even patches for
> schedulerd to speed it up.

There is definitely room for improving the efficiency of the scheduler
-- we've made a couple of big improvements in particular scenarios in
the last few versions.

Recently (2.0.3) we introduced the ability to time the scheduler
simulations. It doesn't go down to the function like your flame graph,
but it's another way of gathering performance data. It's still a bit
rough around the edges but it works like this:

For a single pacemaker build, you can run

  crm_simulate --profile <dir> --repeat 1000

where <dir> is cts/scheduler if you're in a source checkout or
/usr/share/pacemaker/tests/scheduler if you have an install.

That command runs 1000 simulations of each regression test case,
showing the CPU time needed for each case. There are over 700 test
cases, so it takes a long time.

Of course you could give a different directory containing CIB XML files
of your own choosing.

The times vary tremendously for different test cases, so that gives
some insight as to what parts of the code may need more attention.

You can run the same command for two different pacemaker builds, save
the output to files, then run

    tools/pcmk_simtimes <output-files> -s 0.1 -p 5

from a source checkout to show how the two builds compare (the idea
being we can see if one version is more or less efficient than
another). We haven't yet determined what are good numbers to get
comparable times. It would probably be ideal to run it on a bare-metal
machine that's not doing anything else.

None of that is particularly what you are interested in :) but the
bottom line is that speeding up full scheduler runs is more promising
than trying to preserve information between runs. Also, I'd be curious
how Pacemaker 2.0.4 compares to 2.0.1 using your CIB -- there were some
improvements in that time.

> The version of Pacemaker we us is 2.0.1+20190417.13d370ca9 (release
> 3.6.1).
>  
> Thanks a lot in advance!
-- 
Ken Gaillot <kgaillot at redhat.com>