[ClusterLabs] Antw: Q: native_color scores for clones

Tue Sep 4 17:21:30 UTC 2018

On Tue, 2018-09-04 at 11:22 +0200, Ulrich Windl wrote:
> > > > In Reply to my message am 30.08.2018 um 12:23 in Nachricht
> > > > <5B87C5A0.A46 : 161 :
> 
> 60728>:
> > Hi!
> > 
> > After having found showscores.sh, I thought I can improve the
> > perfomance by 
> > porting it to Perl, but it seems the slow part actually is calling
> > pacemakers 
> > helper scripts like crm_attribute, crm_failcount, etc...
> 
> Actually the performance gain was less than expected, until I added a
> cache for calling external programs reading stickiness, fail count
> and migration threshold. Here are the numbers:
> 
> showscores.sh (original):
> real    0m46.181s
> user    0m15.573s
> sys     0m21.761s
> 
> showscores.pl (without cache):
> real    0m46.053s
> user    0m15.861s
> sys     0m20.997s
> 
> showscores.pl (with cache):
> real    0m25.998s
> user    0m7.940s
> sys     0m12.609s
> 
> This made me think whether it's possible to retrieve such attributes
> in a more efficient way, arising the question how the corresponding
> tools actually do work (those attributes are obviously not part of
> the CIB).

Actually they are ... the policy engine (aka scheduler in 2.0) has only
the CIB to make decisions. The other daemons have some additional state
that can affect their behavior, but the scheduling of actions relies
solely on the CIB.

Stickiness and migration-threshold are in the resource configuration
(or defaults); fail counts are in the transient node attributes in the
status section (which can only be retrieved from the live CIB or
attribute daemons, not the CIB on disk, which may be why you didn't see
them).

> I can get the CIB via cib_admin and I can parse the XML if needed,
> but how can I get these other attributes? Tracing crm_attribute, it
> seems it reads some partially binary file from locations like
> /dev/shm/qb-attrd-response-*.

The /dev/shm files are simply where IPC communication is buffered, so
that's just the response from a cib query (the equivalent of cibadmin
-Q).

> I would think that _all_ relevant attributes should be part of the
> CIB...

Yep, they are :)

Often a final value is calculated from the CIB configuration, rather
than directly in it. For example, for stickiness, the actual value
could be in the resource configuration, a resource template, or
resource defaults, or (pre-2.0) the legacy cluster properties for
default stickiness. The configuration "unpacking" code will choose the
final value based on a hierarchy of preference.

> The other thing I realized was that both "migration threshold" and
> "stickiness" are both undefined for several resources (due to the
> fact that the default values for those also aren't defined). I really
> wonder: Why not (e.g.) specify a default stickiness as integer 0
> instead of having a magic NULL value of any type?

I'm guessing you mean the output of the attribute query? That's a good
question.

The current design allows you to distinguish between "the user
explicitly specified a value of 0" and "the user did not specify a
value".

The reason is mainly implementation. The attribute daemon and CIB
daemon intentionally do not have any intelligence about the meaning of
the data they store; they are just generic databases. Therefore, they
do not know the default value for an unspecified attribute. Those are
defined in the scheduler code when unpacking the configuration.

It would be nice to have a tool option to calculate such values in the
same way as the scheduler.

> And the questions that I really tried to answer (but failed to do so)
> using this tool were:
> Where can I see the tendency of the cluster to move (i.e.: balance)
> resources?
> What will happen when I lower the stickiness of a specific resource
> by some amount?

The first question is not so easy, but the second one is: Save the live
CIB to a file, make the change in the file, then run crm_simulate -Sx
$FILENAME.

> 
> The current output of my tool looks a bit different (as there were
> some bugs parsing the output of the tools in the initial version ;-), 
> and I've implemented and used "sort by column", specifically
> Resource, then node, ...):
> 
> Resource               Node     Score Stickin. Fail Count Migr. Thr.
> ---------------------- ---- --------- -------- ---------- ----------
> prm_DLM:0              h02  -INFINITY        1          0      ? (D)
> prm_DLM:0              h06          1        1          0      ? (D)
> prm_DLM:1              h02          1        1          0      ? (D)
> prm_DLM:1              h06          0        1          0      ? (D)
> prm_O2CB:0             h02  -INFINITY        1          0      ? (D)
> prm_O2CB:0             h06          1        1          0      ? (D)
> prm_O2CB:1             h02          1        1          0      ? (D)
> prm_O2CB:1             h06  -INFINITY        1          0      ? (D)
> prm_cfs_locks:0        h02  -INFINITY        1          0      ? (D)
> prm_cfs_locks:0        h06          1        1          0      ? (D)
> prm_cfs_locks:1        h02          1        1          0      ? (D)
> prm_cfs_locks:1        h06  -INFINITY        1          0      ? (D)
> prm_s02_ctdb:0         h02  -INFINITY        1          0      ? (D)
> prm_s02_ctdb:0         h06          1        1          0      ? (D)
> prm_s02_ctdb:1         h02          1        1          0      ? (D)
> prm_s02_ctdb:1         h06  -INFINITY        1          0      ? (D)
> 
> In the table above "?" denotes an undefined value, and suffix " (D)"
> indicates that the default value is being used, so "? (D)" actually
> means, the resource had no value set, and the default value also
> wasn't set, so there is no actual value (see above for discussion of
> this "feature").
> 
> Another interesting point from the previous answers is this:
> How is clone-max=2 or clone-node-max=1 or  master-node-max=1 or
> master-max=1 actually implemented? Magic scores, hidden location
> constraints, or what?

30,158 lines of C code :-)

That's for the entire scheduler; many pieces are reused so it's hard to
say one feature is implemented here or there.

Most things boil down to scores, but there is plenty of logic as well.

As an example for clone-max, when a clone is unpacked from the
configuration, a parent resource is created for the clone itself (just
a logical entity, it does not have any real actions of its own), and
then a child resource is created for each clone instance (the actual
service being cloned, which all real actions operate on). So, we just
create clone-max children, then allocate those children to nodes.

> I tried to locate good documentation for that, but failed to find
> such.
> (In my personal opinion, once you try to document things, you'll find
> bugs, bad concepts, etc.)
> Maybe start documenting things better, to make the product better,
> too.

The documentation to-do list may not be as long as the code to-do, but
it's still pretty impressive ;-)

It's mostly a matter of time. However there is also a limit as to how
far internal algorithms can be documented; at some point, you're better
off tracing through the code.

> 
> Regards,
> Ulrich
> 
> > 
> > But anyway: Being quite confident what my program produces (;-)), I
> > found 
> > some odd score values for clones that run in a two node cluster.
> > For example:
> > 
> > Resource               Node     Score Stickin. Fail Count Migr.
> > Thr.
> > ---------------------- ---- --------- -------- ---------- -------
> > ---
> > prm_DLM:1              h02          1        0          0          
> > 0
> > prm_DLM:1              h06          0        0          0          
> > 0
> > prm_DLM:0              h02  -
> > INFINITY        0          0          0
> > prm_DLM:0              h06          1        0          0          
> > 0
> > prm_O2CB:1             h02          1        0          0          
> > 0
> > prm_O2CB:1             h06  -
> > INFINITY        0          0          0
> > prm_O2CB:0             h02  -
> > INFINITY        0          0          0
> > prm_O2CB:0             h06          1        0          0          
> > 0
> > prm_cfs_locks:0        h02  -
> > INFINITY        0          0          0
> > prm_cfs_locks:0        h06          1        0          0          
> > 0
> > prm_cfs_locks:1        h02          1        0          0          
> > 0
> > prm_cfs_locks:1        h06  -
> > INFINITY        0          0          0
> > prm_s02_ctdb:0         h02  -
> > INFINITY        0          0          0
> > prm_s02_ctdb:0         h06          1        0          0          
> > 0
> > prm_s02_ctdb:1         h02          1        0          0          
> > 0
> > prm_s02_ctdb:1         h06  -
> > INFINITY        0          0          0
> > 
> > For prm_DLM:1 for example one node has score 0, the other node has
> > score 1, 
> > but for prm:DLM:0 the host that has 1 for prm_DLM:1 has -INFINITY
> > (not 0), 
> > while the other host has the usual 1. So I guess that even without
> > -INFINITY 
> > the configuration would be stable. For prm_O2CB two nodes have
> > -INFINITY as 
> > score. For prm_cfs_locks the pattern is as usual, and for
> > rpm_s02_ctdb to 
> > nodes have -INFINITY again.
> > 
> > I don't understand where those -INFINITY scores come from.
> > Pacemaker is 
> > SLES11 SP4 (1.1.12-f47ea56).
> > 
> > It might also be a bug, because when I look at a three-node
> > cluster, I see 
> > that a ":0" resource had score 1 once, and 0 twice, but the
> > corrsponding ":2" 
> > resource has scores 0, 1, and -INFINITY, and the ":1" resource has
> > score 1 
> > once and -INFINITY twice.
> > 
> > When I look at the "clone_solor" scores, the prm_DLM:* primitives
> > look as 
> > expected (no -INFINITY). However the cln_DLM clones have score like
> > 10000, 
> > 8200 and 2200 (depending on the node).
> > 
> > Can someone explain, please?
> > 
> > Regards,
> > Ulrich
-- 
Ken Gaillot <kgaillot at redhat.com>