[ClusterLabs] Antw: Re: Antw: Q: native_color scores for clones

Wed Sep 5 14:13:23 UTC 2018

On Wed, 2018-09-05 at 09:32 +0200, Ulrich Windl wrote:
> > > > Ken Gaillot <kgaillot at redhat.com> schrieb am 04.09.2018 um
> > > > 19:21 in Nachricht
> 
> <1536081690.4387.6.camel at redhat.com>:
> > On Tue, 2018-09-04 at 11:22 +0200, Ulrich Windl wrote:
> > > > > > In Reply to my message am 30.08.2018 um 12:23 in Nachricht
> > > > > > <5B87C5A0.A46 : 161 :
> > > 
> > > 60728>:
> > > > Hi!
> > > > 
> > > > After having found showscores.sh, I thought I can improve the
> > > > perfomance by 
> > > > porting it to Perl, but it seems the slow part actually is
> > > > calling
> > > > pacemakers 
> > > > helper scripts like crm_attribute, crm_failcount, etc...
> > > 
> > > Actually the performance gain was less than expected, until I
> > > added a
> > > cache for calling external programs reading stickiness, fail
> > > count
> > > and migration threshold. Here are the numbers:
> > > 
> > > showscores.sh (original):
> > > real    0m46.181s
> > > user    0m15.573s
> > > sys     0m21.761s
> > > 
> > > showscores.pl (without cache):
> > > real    0m46.053s
> > > user    0m15.861s
> > > sys     0m20.997s
> > > 
> > > showscores.pl (with cache):
> > > real    0m25.998s
> > > user    0m7.940s
> > > sys     0m12.609s
> > > 
> > > This made me think whether it's possible to retrieve such
> > > attributes
> > > in a more efficient way, arising the question how the
> > > corresponding
> > > tools actually do work (those attributes are obviously not part
> > > of
> > > the CIB).
> > 
> > Actually they are ... the policy engine (aka scheduler in 2.0) has
> > only
> > the CIB to make decisions. The other daemons have some additional
> > state
> > that can affect their behavior, but the scheduling of actions
> > relies
> > solely on the CIB.
> > 
> > Stickiness and migration-threshold are in the resource
> > configuration
> > (or defaults); fail counts are in the transient node attributes in
> > the
> > status section (which can only be retrieved from the live CIB or
> > attribute daemons, not the CIB on disk, which may be why you didn't
> > see
> > them).
> 
> Looking for the fail count, the closest match I got looked like this:
> <lrm_rsc_op id="prm_LVM_VMD_last_failure_0"
> operation_key="prm_LVM_VMD_monitor_0" operation="monitor" crm-debug-
> origin="build_active_RAs" crm_feature_set="3.0.10" transition-
> key="58:107:7:8a33be7f-1b68-45bf-9143-54414ff3b662" transition-
> magic="0:0;58:107:7:8a33be7f-1b68-45bf-9143-54414ff3b662"
> on_node="h05" call-id="143" rc-code="0" op-status="0" interval="0"
> last-run="1533809981" last-rc-change="1533809981" exec-time="89"
> queue-time="0" op-digest="22698b9dba36e2926819f13c77569222"/>

That's the most recently failed instance of the operation (for that
node and resource). It's used to show the "failed actions" section of
the crm_mon display.

> Do I have to count and filter such elements, or is there a more
> direct way to get the fail count?

The simplest way to get the fail count is with the crm_failcount tool.
That's especially true since per-operation fail counts were added in
1.1.17.

In the CIB XML, fail counts are stored within <status> <node_state>
<transient_attributes>. The attribute name will start with "fail-count-
".

They can also be queried via attrd_updater, if you know the attribute
name.

> > > I can get the CIB via cib_admin and I can parse the XML if
> > > needed,
> > > but how can I get these other attributes? Tracing crm_attribute,
> > > it
> > > seems it reads some partially binary file from locations like
> > > /dev/shm/qb-attrd-response-*.
> > 
> > The /dev/shm files are simply where IPC communication is buffered,
> > so
> > that's just the response from a cib query (the equivalent of
> > cibadmin
> > -Q).
> > 
> > > I would think that _all_ relevant attributes should be part of
> > > the
> > > CIB...
> > 
> > Yep, they are :)
> 
> I'm still having problems, sorry.
> 
> > 
> > Often a final value is calculated from the CIB configuration,
> > rather
> > than directly in it. For example, for stickiness, the actual value
> > could be in the resource configuration, a resource template, or
> > resource defaults, or (pre-2.0) the legacy cluster properties for
> > default stickiness. The configuration "unpacking" code will choose
> > the
> > final value based on a hierarchy of preference.
> 
> I guess the actual algorithm is hidden somewhere. Could I do that
> with XPath queries and some accumulation of numbers (like using the
> max or min), or is it more complicated?

Most of the default values are directly in the code when unpacking the
configuration. Much of it is in lib/pengine/unpack.c, but parts are
scattered elsewhere in lib/pengine (and it's not an easy read).

Based on how the memory allocation works, anything not explicitly
defaulted otherwise by code defaults to 0.

> > > The other thing I realized was that both "migration threshold"
> > > and
> > > "stickiness" are both undefined for several resources (due to the
> > > fact that the default values for those also aren't defined). I
> > > really
> > > wonder: Why not (e.g.) specify a default stickiness as integer 0
> > > instead of having a magic NULL value of any type?
> > 
> > I'm guessing you mean the output of the attribute query? That's a
> > good
> > question.
> 
> Not just the output: The cluster also needs and algorithm to handle
> that, and I guess just having 0 as default would simplify that
> algorithm.

It does, but the algorithm is purely in the scheduler's unpacking code,
so the values generally can't be queried. The CIB daemon and attribute
daemon don't have that code, so they can't answer queries about it.

In the case of stickiness, lib/pengine/complex.c has this code:

    (*rsc)->stickiness = 0;
    ...
    value = g_hash_table_lookup((*rsc)->meta, XML_RSC_ATTR_STICKINESS);
    if (value != NULL && safe_str_neq("default", value)) {
        (*rsc)->stickiness = char2score(value);
    }

which defaults the stickiness to 0, then uses the integer value of
"resource-stickiness" from meta-attributes (as long as it's not the
literal string "default"). This is after meta-attributes have been
unpacked, which takes care of the precedence of operation attributes >
rsc_defaults > legacy properties.

> > The current design allows you to distinguish between "the user
> > explicitly specified a value of 0" and "the user did not specify a
> > value".
> > 
> > The reason is mainly implementation. The attribute daemon and CIB
> > daemon intentionally do not have any intelligence about the meaning
> > of
> > the data they store; they are just generic databases. Therefore,
> > they
> > do not know the default value for an unspecified attribute. Those
> > are
> > defined in the scheduler code when unpacking the configuration.
> 
> Yes, but having a variable "default value" that does not have a
> default value by itself is somewhat strange. I also think SGML at
> least (XML maybe too) can specify a default value for an attribute,
> but the way name-value pairs are modeled with XML in pacemaker that
> won't help solving this problem, unfortunately.
> 
> > 
> > It would be nice to have a tool option to calculate such values in
> > the
> > same way as the scheduler.
> > 
> > > And the questions that I really tried to answer (but failed to do
> > > so)
> > > using this tool were:
> > > Where can I see the tendency of the cluster to move (i.e.:
> > > balance)
> > > resources?
> > > What will happen when I lower the stickiness of a specific
> > > resource
> > > by some amount?
> > 
> > The first question is not so easy, but the second one is: Save the
> > live
> > CIB to a file, make the change in the file, then run crm_simulate
> > -Sx
> > $FILENAME.
> 
> Yeah, found out: In crm shell you can edit the config, then run
> "sumulate nograph actions" to see what would happen.
> An answer to the first question would be interesting also.
> In your configuration with priorities, utilization-based placement
> and different values of stickiness it's not always clear how the
> cluster decides to place resources. For example (AFAIK) utilisation-
> based placement does not build a tendency for placement; it just adds
> a -INFINITY location score if there are not enough utilisation units
> available on the corresponding node. So IMHO resource-balancing is
> only based on the number of resources already being started on a node
> (i.e. not trying to balance utilization values).
> 
> > 
> > > 
> > > The current output of my tool looks a bit different (as there
> > > were
> > > some bugs parsing the output of the tools in the initial version
> > > ;-), 
> > > and I've implemented and used "sort by column", specifically
> > > Resource, then node, ...):
> > > 
> > > Resource               Node     Score Stickin. Fail Count Migr.
> > > Thr.
> > > ---------------------- ---- --------- -------- ---------- -----
> > > -----
> > > prm_DLM:0              h02  -INFINITY        1          0      ?
> > > (D)
> > > prm_DLM:0              h06          1        1          0      ?
> > > (D)
> > > prm_DLM:1              h02          1        1          0      ?
> > > (D)
> > > prm_DLM:1              h06          0        1          0      ?
> > > (D)
> > > prm_O2CB:0             h02  -INFINITY        1          0      ?
> > > (D)
> > > prm_O2CB:0             h06          1        1          0      ?
> > > (D)
> > > prm_O2CB:1             h02          1        1          0      ?
> > > (D)
> > > prm_O2CB:1             h06  -INFINITY        1          0      ?
> > > (D)
> > > prm_cfs_locks:0        h02  -INFINITY        1          0      ?
> > > (D)
> > > prm_cfs_locks:0        h06          1        1          0      ?
> > > (D)
> > > prm_cfs_locks:1        h02          1        1          0      ?
> > > (D)
> > > prm_cfs_locks:1        h06  -INFINITY        1          0      ?
> > > (D)
> > > prm_s02_ctdb:0         h02  -INFINITY        1          0      ?
> > > (D)
> > > prm_s02_ctdb:0         h06          1        1          0      ?
> > > (D)
> > > prm_s02_ctdb:1         h02          1        1          0      ?
> > > (D)
> > > prm_s02_ctdb:1         h06  -INFINITY        1          0      ?
> > > (D)
> > > 
> > > In the table above "?" denotes an undefined value, and suffix "
> > > (D)"
> > > indicates that the default value is being used, so "? (D)"
> > > actually
> > > means, the resource had no value set, and the default value also
> > > wasn't set, so there is no actual value (see above for discussion
> > > of
> > > this "feature").
> > > 
> > > Another interesting point from the previous answers is this:
> > > How is clone-max=2 or clone-node-max=1 or  master-node-max=1 or
> > > master-max=1 actually implemented? Magic scores, hidden location
> > > constraints, or what?
> > 
> > 30,158 lines of C code :-)
> > 
> > That's for the entire scheduler; many pieces are reused so it's
> > hard to
> > say one feature is implemented here or there.
> > 
> > Most things boil down to scores, but there is plenty of logic as
> > well.
> 
> I see: It's quite easy to debug a single scenario ;-)
> 
> > 
> > As an example for clone-max, when a clone is unpacked from the
> > configuration, a parent resource is created for the clone itself
> > (just
> > a logical entity, it does not have any real actions of its own),
> > and
> > then a child resource is created for each clone instance (the
> > actual
> > service being cloned, which all real actions operate on). So, we
> > just
> > create clone-max children, then allocate those children to nodes.
> 
> That adds some light on the magic of clones! Ans I guess if clone-min 
> cannot be satisfied, the parent gets a -INFINITY score to run
> nowhere...
> 
> > 
> > > I tried to locate good documentation for that, but failed to find
> > > such.
> > > (In my personal opinion, once you try to document things, you'll
> > > find
> > > bugs, bad concepts, etc.)
> > > Maybe start documenting things better, to make the product
> > > better,
> > > too.
> > 
> > The documentation to-do list may not be as long as the code to-do,
> > but
> > it's still pretty impressive ;-)
> > 
> > It's mostly a matter of time. However there is also a limit as to
> > how
> > far internal algorithms can be documented; at some point, you're
> > better
> > off tracing through the code.
> 
> I know: You can produce more code if you do not write documentation;
> usually the time needed is about 1:1.
> 
> Thanks for your insights!
> 
> Regards,
> Ulrich
> 
> > 
> > > 
> > > Regards,
> > > Ulrich
> > > 
> > > > 
> > > > But anyway: Being quite confident what my program produces (;-
> > > > )), I
> > > > found 
> > > > some odd score values for clones that run in a two node
> > > > cluster.
> > > > For example:
> > > > 
> > > > Resource               Node     Score Stickin. Fail Count Migr.
> > > > Thr.
> > > > ---------------------- ---- --------- -------- ---------- ---
> > > > ----
> > > > ---
> > > > prm_DLM:1              h02          1        0          0      
> > > >     
> > > > 0
> > > > prm_DLM:1              h06          0        0          0      
> > > >     
> > > > 0
> > > > prm_DLM:0              h02  -
> > > > INFINITY        0          0          0
> > > > prm_DLM:0              h06          1        0          0      
> > > >     
> > > > 0
> > > > prm_O2CB:1             h02          1        0          0      
> > > >     
> > > > 0
> > > > prm_O2CB:1            h06  -
> > > > INFINITY        0          0          0
> > > > prm_O2CB:0             h02  -
> > > > INFINITY        0          0          0
> > > > prm_O2CB:0             h06          1        0          0      
> > > >     
> > > > 0
> > > > prm_cfs_locks:0        h02  -
> > > > INFINITY        0          0          0
> > > > prm_cfs_locks:0        h06          1        0          0      
> > > >     
> > > > 0
> > > > prm_cfs_locks:1        h02          1        0          0      
> > > >     
> > > > 0
> > > > prm_cfs_locks:1        h06  -
> > > > INFINITY        0          0          0
> > > > prm_s02_ctdb:0         h02  -
> > > > INFINITY        0          0          0
> > > > prm_s02_ctdb:0         h06          1        0          0      
> > > >     
> > > > 0
> > > > prm_s02_ctdb:1         h02          1        0          0      
> > > >     
> > > > 0
> > > > prm_s02_ctdb:1         h06  -
> > > > INFINITY        0          0          0
> > > > 
> > > > For prm_DLM:1 for example one node has score 0, the other node
> > > > has
> > > > score 1, 
> > > > but for prm:DLM:0 the host that has 1 for prm_DLM:1 has
> > > > -INFINITY
> > > > (not 0), 
> > > > while the other host has the usual 1. So I guess that even
> > > > without
> > > > -INFINITY 
> > > > the configuration would be stable. For prm_O2CB two nodes have
> > > > -INFINITY as 
> > > > score. For prm_cfs_locks the pattern is as usual, and for
> > > > rpm_s02_ctdb to 
> > > > nodes have -INFINITY again.
> > > > 
> > > > I don't understand where those -INFINITY scores come from.
> > > > Pacemaker is 
> > > > SLES11 SP4 (1.1.12-f47ea56).
> > > > 
> > > > It might also be a bug, because when I look at a three-node
> > > > cluster, I see 
> > > > that a ":0" resource had score 1 once, and 0 twice, but the
> > > > corrsponding ":2" 
> > > > resource has scores 0, 1, and -INFINITY, and the ":1" resource
> > > > has
> > > > score 1 
> > > > once and -INFINITY twice.
> > > > 
> > > > When I look at the "clone_solor" scores, the prm_DLM:*
> > > > primitives
> > > > look as 
> > > > expected (no -INFINITY). However the cln_DLM clones have score
> > > > like
> > > > 10000, 
> > > > 8200 and 2200 (depending on the node).
> > > > 
> > > > Can someone explain, please?
> > > > 
> > > > Regards,
> > > > Ulrich
-- 
Ken Gaillot <kgaillot at redhat.com>