[ClusterLabs] Antw: [EXT] Re: Q: utilization, stickiness and resource placement

Fri Jan 22 11:01:50 EST 2021

On Fri, 2021-01-22 at 08:38 +0100, Ulrich Windl wrote:
> > > > Ken Gaillot <kgaillot at redhat.com> schrieb am 21.01.2021 um
> > > > 17:24 in
> 
> Nachricht
> <28f8b077a30233efa41d04688eb21e82c8432ddd.camel at redhat.com>:
> > On Thu, 2021‑01‑21 at 08:19 +0100, Ulrich Windl wrote:
> > > Hi!
> > > 
> > > I have a question about utilization‑based resource placement
> > > (specifically: placement‑strategy=balanced):
> > > Assume you have two resource capacities (say A and B) on each
> > > node,
> > > and each resource also has a utilization parameter for both.
> > > Both nodes have enough capacity for a resource to be started.
> > > Consider these cases for resource R:
> > > 1) R needs A = B
> > > 2) R needs A > B
> > > 3) R needs A < B
> > > 
> > > Maybe consider these cases for each node:
> > > a) A = B
> > > b) A > B
> > > c) A < B
> > > 
> > > Where would the resources be placed?
> > 
> > For computational efficiency, Pacemaker follows a very simple
> > algorithm, described here:
> > 
> > 
> 
> https://clusterlabs.org/pacemaker/doc/en‑US/Pacemaker/2.0/html‑single/Pacemake
> 
> > r_Explained/index.html#_allocation_details
> > 
> > Basically, nodes and resources are sorted according to a weighting,
> > nodes are assigned resources starting with the highest‑weighted
> > node
> > first, and individual resources are placed starting with the
> > highest‑
> > weighted resource first. That link describes the weighting.
> 
> Hi!
> 
> That's interesting: I thought pacemaker picks a resource to run
> first, and
> then a node to run the resource, but it seems the other way round:
> First pick a
> node, then a resource.
> However when looking at the output of "crm_simulate -LUs", I see node
> scores
> per resource, that is many of them instead of one.

Definitely -- each resource has a score on each node, and each
resource's preferred node is the node with the highest score for it.

> Also there is a phrase I don't understand: "The resource that has the
> highest
> score on the node where it's running gets allocated first..." Why
> does a
> resource that is running already has to be allocated?

Where it is now is not necessarily where it should be next.

It could be stopping or migrating, or newly added resources might shift
the balance (with or without utilization), or a resource it depends on
might be moving, or there might be constraint changes, time-based
rules, etc. etc.

> Also it seems the output of crm_simulate does not present the
> absolute
> numbers, but a computation. For example let's look at the DLM clone
> here:
> pcmk__clone_allocate: cln_DLM allocation score on h16: 4000
> pcmk__clone_allocate: cln_DLM allocation score on h18: 4000
> pcmk__clone_allocate: cln_DLM allocation score on h19: 8000
> 
> # OK, for some reason h19 is preferred significantly (by 4000)
> 
> pcmk__clone_allocate: prm_DLM:0 allocation score on h16: 1
> pcmk__clone_allocate: prm_DLM:0 allocation score on h18: 0
> pcmk__clone_allocate: prm_DLM:0 allocation score on h19: 0
> 
> # The first instance prefers h16 however. Why not h19, BTW?
> 
> pcmk__clone_allocate: prm_DLM:1 allocation score on h16: 0
> pcmk__clone_allocate: prm_DLM:1 allocation score on h18: 0
> pcmk__clone_allocate: prm_DLM:1 allocation score on h19: 1
> 
> # the second instance prefers h19
> 
> pcmk__clone_allocate: prm_DLM:2 allocation score on h16: 0
> pcmk__clone_allocate: prm_DLM:2 allocation score on h18: 1
> pcmk__clone_allocate: prm_DLM:2 allocation score on h19: 0
> 
> # so the third instance goes to h18
> 
> pcmk__native_allocate: prm_DLM:1 allocation score on h16: 0
> pcmk__native_allocate: prm_DLM:1 allocation score on h18: 0
> pcmk__native_allocate: prm_DLM:1 allocation score on h19: 1
> native_assign_node: prm_DLM:1 utilization on h19:
> 
> # so the second instance goes to h19 (see above)
> 
> pcmk__native_allocate: prm_DLM:0 allocation score on h16: 1
> pcmk__native_allocate: prm_DLM:0 allocation score on h18: 0
> pcmk__native_allocate: prm_DLM:0 allocation score on h19: -INFINITY
> native_assign_node: prm_DLM:0 utilization on h16:
> 
> # the first instance goes to h16, and h19 gets -INF as there is
> already an
> instance
> 
> pcmk__native_allocate: prm_DLM:2 allocation score on h16: -INFINITY
> pcmk__native_allocate: prm_DLM:2 allocation score on h18: 1
> pcmk__native_allocate: prm_DLM:2 allocation score on h19: -INFINITY
> native_assign_node: prm_DLM:2 utilization on h18:
> 
> # the third instance goes to h18 as the other two have -INF
> 
> # What I wanted to say: Why don't have the other nodes a score of
> -INF right
> from the beginning?

Because that's what code is for :)

Everything starts at 0, and the code proceeds through a very
complicated and obscure series of steps to consider a zillion factors
one by one and update the scores. It's very daunting and impossible for
the human mind to comprehend all at once (at least anyone I've met
...).

Hopefully over time we can get it to be clearer about what it's doing
but it's just a lot of information to try to condense.

> (despite of that, the whole description of crm_simulate is very
> short, namely:
> "crm_simulate - simulate a Pacemaker cluster's response to events")
> 
> I understand that the cluster cannot find the perfect allocation
> easily, but
> if it could assign some quality factor (e.g.: 0-100) that says how
> well the
> placement fulfills the requirement, it could in principle evaluate
> other
> allocations, leading to a different quality factor. Then I could
> imagine the
> bigger the difference in the quality factor, the less stickiness you
> need to
> trigger a different allocation.
> 
> Despite of that (I think that is called Monte Carlo method) the
> cluster could
> evaluate some random allocations, remembering that with the highest
> quality
> factor. That wouldn't be optimal, but depending on the number of
> iterations
> could be "rather good". As documented this may be computationally
> expensive,
> but some external tool could assist to determine that while the
> cluster is
> doing its job.

Definitely there are some interesting alternative approaches, it would
be great to have somebody expert in those fields dedicated to exploring
that for a while.

> > > (Obviously A and B are independent, so you can't say "How many A
> > > is
> > > worth one B" (or vice versa)
> > > 
> > > Then, given some placement, assume that node capacity for A or B
> > > increases on a node.
> > > How large has the stickiness parameter for the resource have to
> > > be to
> > > prevent resource migration?
> > 
> > That's tricky to deduce. The easiest approach is to copy a
> > scheduler
> > input from the cluster as it is, and experiment by modifying the
> > stickiness and utilization values as desired and running
> > crm_simulate.
> > 
> > Everything boils down to each resource having a score on each node.
> > That final score is the cumulative result of all sorts of effects,
> > e.g.
> > stickiness, the resource's own location preferences (constraint
> > scores), a fraction of the location preferences of resources that
> > are
> > directly or indirectly colocated with this resource, etc., so it's
> > not
> > easy to say "changing this one value by X will have this effect".
> > 
> > > Are there any tools to help understanding?
> > 
> > Mainly crm_simulate
> 
> Another note: Is it correctt that the example given assumes that
> memcached is
> a clone? If not what is the zero in "memcached:0". Maybe the manual 

Yes, anything with :N is a clone or bundle instance

> page could
> be extended a bit to explain. Fixing the alignment would be a good
> idea, too:
> 
> EXAMPLES
>        Pretend a recurring monitor
> action  found  memcached  stopped  on 
> node
>        fred.example.com  and,  during recovery, that the memcached
> stop
> action
>        failed:
> 
>               crm_simulate       -LS       --op-inject      
> memcached:0_moni-
>               tor_20000 at bart.example.com=7            --op-
> fail          
> mem-
>               cached:0_stop_0 at fred.example.com=1    --save-output   
> /tmp/mem-
>               cached-test.xml
> 
> I'm not a roff-geek, but in my pages I use a combination of ".na" and
> ".ad b"
> around the example input.
> 
> Regards,
> Ulrich
> 
> > 
> > > Note: For placement‑strategy=utilization it's easier: As long as
> > > there is sufficient capacity, distribute the resources on the
> > > node
> > > that has least number of resources.
> > > 
> > > Regards,
> > > Ulrich

-- 
Ken Gaillot <kgaillot at redhat.com>