[ClusterLabs Developers] [ClusterLabs] [pacemaker] Discretion with glib v2.59.0+ recommended

Tue Feb 12 16:31:30 UTC 2019

On Mon, 2019-02-11 at 17:01 -0600, Ken Gaillot wrote:
> On Mon, 2019-02-11 at 22:48 +0100, Jan Pokorný wrote:
> > On 20/01/19 12:44 +0100, Jan Pokorný wrote:
> > > On 18/01/19 20:32 +0100, Jan Pokorný wrote:
> > > > It was discovered that this release of glib project changed
> > > > sligthly
> > > > some parameters of how distribution of values within  hash
> > > > tables
> > > > structures work, undermining pacemaker's hard (alas unfeasible)
> > > > attempt
> > > > to turn this data type into fully predictable entity.
> > > > 
> > > > Current impact is unknown beside some internal regression test
> > > > failing
> > > > due to this, so that, e.g., in the environment variables passed
> > > > in the
> > > > notification messages, the order of the active nodes (being a
> > > > space
> > > > separarated list) may be appear shuffled in comparison with the
> > > > long
> > > > standing (and perhaps making a false impression of determinism)
> > > > behaviour witnessed with older versions of glib in the game.
> > > 
> > > Our immediate response is to, at the very least, make the
> > > cts-scheduler regression suite (the only localhost one that was
> > > rendered broken with 52 tests out of 733 failed) skip those tests
> > > where reliance on the exact order of hash-table-driven items was
> > > sported, so it won't fail as a whole:
> > > 
> > > 
> 
> 
https://github.com/ClusterLabs/pacemaker/pull/1677/commits/15ace890ef0b987db035ee2d71994e37f7eaff96
> > > [above edit: updated with the newer version of the patch]
> > 
> > Shout-out to Ken for fixing the immediate fallout (deterministic
> > output breakages in some cts-scheduler tests, making the above
> > change superfluous) for the upcoming 2.0.1 release!
> > 
> > > > Variations like these are expected, and you may take it as an
> > > > opportunity to fix incorrect order-wise (like in the stated
> > > > case)
> > > > assumptions.
> > > 
> > > [intentionally CC'd developers@, should have done it since
> > > beginning]
> > > 
> > > At this point, testing with glib v2.59.0+, preferably using
> > > 2.0.1-
> > > rc3
> > > due to the release cycle timing, is VERY DESIRED if you are
> > > considering
> > > providing some volunteer capacity to pacemaker project,
> > > especially
> > > if
> > > you have your own agents and scripts that rely on the exact (and
> > > previously likely stable) order of "set data made linear, hence
> > > artificially ordered", like with
> > > OCF_RESKEY_CRM_meta_notify_active_uname
> > > environment variable in clone notifications (as was already
> > > suggested;
> > > complete list is also unknown at this point, unfortunately, for a
> > > lack
> > > of systemic and precise data items tracking in general).

Also, there is a bit of confusion here: The *value* of each environment
variable has always been a list (not hash table) in a guaranteed order.

What's affected is the environment variables themselves, i.e. the order
in which OCF_RESKEY_CRM_meta_notify_* appear in the graph action meta-
data. This is purely internal to pacemaker and has no effect on
resource agents, as there already is no concept of environment
variables being an ordered list.

As documented, resource agents can rely not on any particular ordering
of a single environment variable value, but on the pairing of values
between related environment variables (e.g.
OCF_RESKEY_CRM_meta_notify_active_resource and
OCF_RESKEY_CRM_meta_notify_active_uname).

> > While some of these if not all are now ordered, I'd call using
> > "stable ordered list" approach to these variable, as opposed to
> > "plain unordered set" one, from within agents as continuously
> > frowned-upon unless explicitly lifted.  For predictable
> > backward/forward pacemaker+glib version compatibility if
> > for no other reason.
> > 
> > Ken, do you agree?
> > 
> > (If so, we shall keep that in mind for future documentation tweaks
> > [possibly including also OCF updates], so no false assumptions
> > won't
> > be cast for new agent implementations going forward.)
> 
> Correct, the lists given to resource agents via clone notifications
> environment variables are not guaranteed to be in any particular
> order. 
> 
> The documentation already does not claim any ordering, and in fact
> gives an example where node names are not in alphabetic order, so I
> think it's pretty obvious.
> 
> > 
> > > > More serious troubles stemming from this expectation-reality
> > > > mismatch
> > > > regarding said data type cannot be denied at this point,
> > > > subject
> > > > of
> > > > further investigation.  When in doubt, staying with glib up to
> > > > and
> > > > including v2.58.2 (said tests are passing with it, though any
> > > > later
> > > > v2.58.* may keep working "as always") is likely a good idea for
> > > > the
> > > > time being.
> > 
> > It think this still partially holds and only time-proven as fully
> > settled?  I mean, for anything truly reproducible (as in
> > crm_simulate),
> > either pacemaker prior to 2.0.1 combined with glib pre- or equal-
> > or-
> > post-
> > 2.59.0 need to be uniformly (reproducers need to follow the
> > original)
> > combined to get the same results, and with pacemaker 2.0.1+,
> > identical
> > results (but possibly differing against either of the former
> > combos)
> > will _likely_ be obtained regardless of particular run-time linked
> > glib
> > version, but strength of this "likely" will only be established
> > with
> > future experience, I suppose (but shall universally hold with the
> > same
> > glib class per stated division, so no change in this already
> > positive
> > regard).
> > 
> > Just scratched the surface, so gladly be corrected.