[ClusterLabs] Antw: Re: Antw: Q: native_color scores for clones

Wed Sep 5 07:32:22 UTC 2018

>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 04.09.2018 um 19:21 in Nachricht
<1536081690.4387.6.camel at redhat.com>:
> On Tue, 2018-09-04 at 11:22 +0200, Ulrich Windl wrote:
>> > > > In Reply to my message am 30.08.2018 um 12:23 in Nachricht
>> > > > <5B87C5A0.A46 : 161 :
>> 
>> 60728>:
>> > Hi!
>> > 
>> > After having found showscores.sh, I thought I can improve the
>> > perfomance by 
>> > porting it to Perl, but it seems the slow part actually is calling
>> > pacemakers 
>> > helper scripts like crm_attribute, crm_failcount, etc...
>> 
>> Actually the performance gain was less than expected, until I added a
>> cache for calling external programs reading stickiness, fail count
>> and migration threshold. Here are the numbers:
>> 
>> showscores.sh (original):
>> real    0m46.181s
>> user    0m15.573s
>> sys     0m21.761s
>> 
>> showscores.pl (without cache):
>> real    0m46.053s
>> user    0m15.861s
>> sys     0m20.997s
>> 
>> showscores.pl (with cache):
>> real    0m25.998s
>> user    0m7.940s
>> sys     0m12.609s
>> 
>> This made me think whether it's possible to retrieve such attributes
>> in a more efficient way, arising the question how the corresponding
>> tools actually do work (those attributes are obviously not part of
>> the CIB).
> 
> Actually they are ... the policy engine (aka scheduler in 2.0) has only
> the CIB to make decisions. The other daemons have some additional state
> that can affect their behavior, but the scheduling of actions relies
> solely on the CIB.
> 
> Stickiness and migration-threshold are in the resource configuration
> (or defaults); fail counts are in the transient node attributes in the
> status section (which can only be retrieved from the live CIB or
> attribute daemons, not the CIB on disk, which may be why you didn't see
> them).

Looking for the fail count, the closest match I got looked like this:
<lrm_rsc_op id="prm_LVM_VMD_last_failure_0" operation_key="prm_LVM_VMD_monitor_0" operation="monitor" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10" transition-key="58:107:7:8a33be7f-1b68-45bf-9143-54414ff3b662" transition-magic="0:0;58:107:7:8a33be7f-1b68-45bf-9143-54414ff3b662" on_node="h05" call-id="143" rc-code="0" op-status="0" interval="0" last-run="1533809981" last-rc-change="1533809981" exec-time="89" queue-time="0" op-digest="22698b9dba36e2926819f13c77569222"/>

Do I have to count and filter such elements, or is there a more direct way to get the fail count?

> 
>> I can get the CIB via cib_admin and I can parse the XML if needed,
>> but how can I get these other attributes? Tracing crm_attribute, it
>> seems it reads some partially binary file from locations like
>> /dev/shm/qb-attrd-response-*.
> 
> The /dev/shm files are simply where IPC communication is buffered, so
> that's just the response from a cib query (the equivalent of cibadmin
> -Q).
> 
>> I would think that _all_ relevant attributes should be part of the
>> CIB...
> 
> Yep, they are :)

I'm still having problems, sorry.

> 
> Often a final value is calculated from the CIB configuration, rather
> than directly in it. For example, for stickiness, the actual value
> could be in the resource configuration, a resource template, or
> resource defaults, or (pre-2.0) the legacy cluster properties for
> default stickiness. The configuration "unpacking" code will choose the
> final value based on a hierarchy of preference.

I guess the actual algorithm is hidden somewhere. Could I do that with XPath queries and some accumulation of numbers (like using the max or min), or is it more complicated?

> 
>> The other thing I realized was that both "migration threshold" and
>> "stickiness" are both undefined for several resources (due to the
>> fact that the default values for those also aren't defined). I really
>> wonder: Why not (e.g.) specify a default stickiness as integer 0
>> instead of having a magic NULL value of any type?
> 
> I'm guessing you mean the output of the attribute query? That's a good
> question.

Not just the output: The cluster also needs and algorithm to handle that, and I guess just having 0 as default would simplify that algorithm.

> 
> The current design allows you to distinguish between "the user
> explicitly specified a value of 0" and "the user did not specify a
> value".
> 
> The reason is mainly implementation. The attribute daemon and CIB
> daemon intentionally do not have any intelligence about the meaning of
> the data they store; they are just generic databases. Therefore, they
> do not know the default value for an unspecified attribute. Those are
> defined in the scheduler code when unpacking the configuration.

Yes, but having a variable "default value" that does not have a default value by itself is somewhat strange. I also think SGML at least (XML maybe too) can specify a default value for an attribute, but the way name-value pairs are modeled with XML in pacemaker that won't help solving this problem, unfortunately.

> 
> It would be nice to have a tool option to calculate such values in the
> same way as the scheduler.
> 
>> And the questions that I really tried to answer (but failed to do so)
>> using this tool were:
>> Where can I see the tendency of the cluster to move (i.e.: balance)
>> resources?
>> What will happen when I lower the stickiness of a specific resource
>> by some amount?
> 
> The first question is not so easy, but the second one is: Save the live
> CIB to a file, make the change in the file, then run crm_simulate -Sx
> $FILENAME.

Yeah, found out: In crm shell you can edit the config, then run "sumulate nograph actions" to see what would happen.
An answer to the first question would be interesting also.
In your configuration with priorities, utilization-based placement and different values of stickiness it's not always clear how the cluster decides to place resources. For example (AFAIK) utilisation-based placement does not build a tendency for placement; it just adds a -INFINITY location score if there are not enough utilisation units available on the corresponding node. So IMHO resource-balancing is only based on the number of resources already being started on a node (i.e. not trying to balance utilization values).

> 
>> 
>> The current output of my tool looks a bit different (as there were
>> some bugs parsing the output of the tools in the initial version ;-), 
>> and I've implemented and used "sort by column", specifically
>> Resource, then node, ...):
>> 
>> Resource               Node     Score Stickin. Fail Count Migr. Thr.
>> ---------------------- ---- --------- -------- ---------- ----------
>> prm_DLM:0              h02  -INFINITY        1          0      ? (D)
>> prm_DLM:0              h06          1        1          0      ? (D)
>> prm_DLM:1              h02          1        1          0      ? (D)
>> prm_DLM:1              h06          0        1          0      ? (D)
>> prm_O2CB:0             h02  -INFINITY        1          0      ? (D)
>> prm_O2CB:0             h06          1        1          0      ? (D)
>> prm_O2CB:1             h02          1        1          0      ? (D)
>> prm_O2CB:1             h06  -INFINITY        1          0      ? (D)
>> prm_cfs_locks:0        h02  -INFINITY        1          0      ? (D)
>> prm_cfs_locks:0        h06          1        1          0      ? (D)
>> prm_cfs_locks:1        h02          1        1          0      ? (D)
>> prm_cfs_locks:1        h06  -INFINITY        1          0      ? (D)
>> prm_s02_ctdb:0         h02  -INFINITY        1          0      ? (D)
>> prm_s02_ctdb:0         h06          1        1          0      ? (D)
>> prm_s02_ctdb:1         h02          1        1          0      ? (D)
>> prm_s02_ctdb:1         h06  -INFINITY        1          0      ? (D)
>> 
>> In the table above "?" denotes an undefined value, and suffix " (D)"
>> indicates that the default value is being used, so "? (D)" actually
>> means, the resource had no value set, and the default value also
>> wasn't set, so there is no actual value (see above for discussion of
>> this "feature").
>> 
>> Another interesting point from the previous answers is this:
>> How is clone-max=2 or clone-node-max=1 or  master-node-max=1 or
>> master-max=1 actually implemented? Magic scores, hidden location
>> constraints, or what?
> 
> 30,158 lines of C code :-)
> 
> That's for the entire scheduler; many pieces are reused so it's hard to
> say one feature is implemented here or there.
> 
> Most things boil down to scores, but there is plenty of logic as well.

I see: It's quite easy to debug a single scenario ;-)

> 
> As an example for clone-max, when a clone is unpacked from the
> configuration, a parent resource is created for the clone itself (just
> a logical entity, it does not have any real actions of its own), and
> then a child resource is created for each clone instance (the actual
> service being cloned, which all real actions operate on). So, we just
> create clone-max children, then allocate those children to nodes.

That adds some light on the magic of clones! Ans I guess if clone-min cannot be satisfied, the parent gets a -INFINITY score to run nowhere...

> 
>> I tried to locate good documentation for that, but failed to find
>> such.
>> (In my personal opinion, once you try to document things, you'll find
>> bugs, bad concepts, etc.)
>> Maybe start documenting things better, to make the product better,
>> too.
> 
> The documentation to-do list may not be as long as the code to-do, but
> it's still pretty impressive ;-)
> 
> It's mostly a matter of time. However there is also a limit as to how
> far internal algorithms can be documented; at some point, you're better
> off tracing through the code.

I know: You can produce more code if you do not write documentation; usually the time needed is about 1:1.

Thanks for your insights!

Regards,
Ulrich

> 
>> 
>> Regards,
>> Ulrich
>> 
>> > 
>> > But anyway: Being quite confident what my program produces (;-)), I
>> > found 
>> > some odd score values for clones that run in a two node cluster.
>> > For example:
>> > 
>> > Resource               Node     Score Stickin. Fail Count Migr.
>> > Thr.
>> > ---------------------- ---- --------- -------- ---------- -------
>> > ---
>> > prm_DLM:1              h02          1        0          0          
>> > 0
>> > prm_DLM:1              h06          0        0          0          
>> > 0
>> > prm_DLM:0              h02  -
>> > INFINITY        0          0          0
>> > prm_DLM:0              h06          1        0          0          
>> > 0
>> > prm_O2CB:1             h02          1        0          0          
>> > 0
>> > prm_O2CB:1            h06  -
>> > INFINITY        0          0          0
>> > prm_O2CB:0             h02  -
>> > INFINITY        0          0          0
>> > prm_O2CB:0             h06          1        0          0          
>> > 0
>> > prm_cfs_locks:0        h02  -
>> > INFINITY        0          0          0
>> > prm_cfs_locks:0        h06          1        0          0          
>> > 0
>> > prm_cfs_locks:1        h02          1        0          0          
>> > 0
>> > prm_cfs_locks:1        h06  -
>> > INFINITY        0          0          0
>> > prm_s02_ctdb:0         h02  -
>> > INFINITY        0          0          0
>> > prm_s02_ctdb:0         h06          1        0          0          
>> > 0
>> > prm_s02_ctdb:1         h02          1        0          0          
>> > 0
>> > prm_s02_ctdb:1         h06  -
>> > INFINITY        0          0          0
>> > 
>> > For prm_DLM:1 for example one node has score 0, the other node has
>> > score 1, 
>> > but for prm:DLM:0 the host that has 1 for prm_DLM:1 has -INFINITY
>> > (not 0), 
>> > while the other host has the usual 1. So I guess that even without
>> > -INFINITY 
>> > the configuration would be stable. For prm_O2CB two nodes have
>> > -INFINITY as 
>> > score. For prm_cfs_locks the pattern is as usual, and for
>> > rpm_s02_ctdb to 
>> > nodes have -INFINITY again.
>> > 
>> > I don't understand where those -INFINITY scores come from.
>> > Pacemaker is 
>> > SLES11 SP4 (1.1.12-f47ea56).
>> > 
>> > It might also be a bug, because when I look at a three-node
>> > cluster, I see 
>> > that a ":0" resource had score 1 once, and 0 twice, but the
>> > corrsponding ":2" 
>> > resource has scores 0, 1, and -INFINITY, and the ":1" resource has
>> > score 1 
>> > once and -INFINITY twice.
>> > 
>> > When I look at the "clone_solor" scores, the prm_DLM:* primitives
>> > look as 
>> > expected (no -INFINITY). However the cln_DLM clones have score like
>> > 10000, 
>> > 8200 and 2200 (depending on the node).
>> > 
>> > Can someone explain, please?
>> > 
>> > Regards,
>> > Ulrich
> -- 
> Ken Gaillot <kgaillot at redhat.com>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org