[ClusterLabs] Antw: [EXT] Re: Q: utilization, stickiness and resource placement

Fri Jan 22 02:38:25 EST 2021

>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 21.01.2021 um 17:24 in
Nachricht
<28f8b077a30233efa41d04688eb21e82c8432ddd.camel at redhat.com>:
> On Thu, 2021‑01‑21 at 08:19 +0100, Ulrich Windl wrote:
>> Hi!
>> 
>> I have a question about utilization‑based resource placement
>> (specifically: placement‑strategy=balanced):
>> Assume you have two resource capacities (say A and B) on each node,
>> and each resource also has a utilization parameter for both.
>> Both nodes have enough capacity for a resource to be started.
>> Consider these cases for resource R:
>> 1) R needs A = B
>> 2) R needs A > B
>> 3) R needs A < B
>> 
>> Maybe consider these cases for each node:
>> a) A = B
>> b) A > B
>> c) A < B
>> 
>> Where would the resources be placed?
> 
> For computational efficiency, Pacemaker follows a very simple
> algorithm, described here:
> 
>
https://clusterlabs.org/pacemaker/doc/en‑US/Pacemaker/2.0/html‑single/Pacemake

> r_Explained/index.html#_allocation_details
> 
> Basically, nodes and resources are sorted according to a weighting,
> nodes are assigned resources starting with the highest‑weighted node
> first, and individual resources are placed starting with the highest‑
> weighted resource first. That link describes the weighting.

Hi!

That's interesting: I thought pacemaker picks a resource to run first, and
then a node to run the resource, but it seems the other way round: First pick a
node, then a resource.
However when looking at the output of "crm_simulate -LUs", I see node scores
per resource, that is many of them instead of one.
Also there is a phrase I don't understand: "The resource that has the highest
score on the node where it's running gets allocated first..." Why does a
resource that is running already has to be allocated?

Also it seems the output of crm_simulate does not present the absolute
numbers, but a computation. For example let's look at the DLM clone here:
pcmk__clone_allocate: cln_DLM allocation score on h16: 4000
pcmk__clone_allocate: cln_DLM allocation score on h18: 4000
pcmk__clone_allocate: cln_DLM allocation score on h19: 8000

# OK, for some reason h19 is preferred significantly (by 4000)

pcmk__clone_allocate: prm_DLM:0 allocation score on h16: 1
pcmk__clone_allocate: prm_DLM:0 allocation score on h18: 0
pcmk__clone_allocate: prm_DLM:0 allocation score on h19: 0

# The first instance prefers h16 however. Why not h19, BTW?

pcmk__clone_allocate: prm_DLM:1 allocation score on h16: 0
pcmk__clone_allocate: prm_DLM:1 allocation score on h18: 0
pcmk__clone_allocate: prm_DLM:1 allocation score on h19: 1

# the second instance prefers h19

pcmk__clone_allocate: prm_DLM:2 allocation score on h16: 0
pcmk__clone_allocate: prm_DLM:2 allocation score on h18: 1
pcmk__clone_allocate: prm_DLM:2 allocation score on h19: 0

# so the third instance goes to h18

pcmk__native_allocate: prm_DLM:1 allocation score on h16: 0
pcmk__native_allocate: prm_DLM:1 allocation score on h18: 0
pcmk__native_allocate: prm_DLM:1 allocation score on h19: 1
native_assign_node: prm_DLM:1 utilization on h19:

# so the second instance goes to h19 (see above)

pcmk__native_allocate: prm_DLM:0 allocation score on h16: 1
pcmk__native_allocate: prm_DLM:0 allocation score on h18: 0
pcmk__native_allocate: prm_DLM:0 allocation score on h19: -INFINITY
native_assign_node: prm_DLM:0 utilization on h16:

# the first instance goes to h16, and h19 gets -INF as there is already an
instance

pcmk__native_allocate: prm_DLM:2 allocation score on h16: -INFINITY
pcmk__native_allocate: prm_DLM:2 allocation score on h18: 1
pcmk__native_allocate: prm_DLM:2 allocation score on h19: -INFINITY
native_assign_node: prm_DLM:2 utilization on h18:

# the third instance goes to h18 as the other two have -INF

# What I wanted to say: Why don't have the other nodes a score of -INF right
from the beginning?
(despite of that, the whole description of crm_simulate is very short, namely:
"crm_simulate - simulate a Pacemaker cluster's response to events")

I understand that the cluster cannot find the perfect allocation easily, but
if it could assign some quality factor (e.g.: 0-100) that says how well the
placement fulfills the requirement, it could in principle evaluate other
allocations, leading to a different quality factor. Then I could imagine the
bigger the difference in the quality factor, the less stickiness you need to
trigger a different allocation.

Despite of that (I think that is called Monte Carlo method) the cluster could
evaluate some random allocations, remembering that with the highest quality
factor. That wouldn't be optimal, but depending on the number of iterations
could be "rather good". As documented this may be computationally expensive,
but some external tool could assist to determine that while the cluster is
doing its job.

> 
>> (Obviously A and B are independent, so you can't say "How many A is
>> worth one B" (or vice versa)
>> 
>> Then, given some placement, assume that node capacity for A or B
>> increases on a node.
>> How large has the stickiness parameter for the resource have to be to
>> prevent resource migration?
> 
> That's tricky to deduce. The easiest approach is to copy a scheduler
> input from the cluster as it is, and experiment by modifying the
> stickiness and utilization values as desired and running crm_simulate.
> 
> Everything boils down to each resource having a score on each node.
> That final score is the cumulative result of all sorts of effects, e.g.
> stickiness, the resource's own location preferences (constraint
> scores), a fraction of the location preferences of resources that are
> directly or indirectly colocated with this resource, etc., so it's not
> easy to say "changing this one value by X will have this effect".
> 
>> Are there any tools to help understanding?
> 
> Mainly crm_simulate

Another note: Is it correctt that the example given assumes that memcached is
a clone? If not what is the zero in "memcached:0". Maybe the manual page could
be extended a bit to explain. Fixing the alignment would be a good idea, too:

EXAMPLES
       Pretend a recurring monitor action  found  memcached  stopped  on 
node
       fred.example.com  and,  during recovery, that the memcached stop
action
       failed:

              crm_simulate       -LS       --op-inject      
memcached:0_moni-
              tor_20000 at bart.example.com=7            --op-fail          
mem-
              cached:0_stop_0 at fred.example.com=1    --save-output   
/tmp/mem-
              cached-test.xml

I'm not a roff-geek, but in my pages I use a combination of ".na" and ".ad b"
around the example input.

Regards,
Ulrich

> 
>> Note: For placement‑strategy=utilization it's easier: As long as
>> there is sufficient capacity, distribute the resources on the node
>> that has least number of resources.
>> 
>> Regards,
>> Ulrich
> ‑‑ 
> Ken Gaillot <kgaillot at redhat.com>
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/