[Pacemaker] Release model

Fri Jun 28 08:04:48 EDT 2013

On 28/06/2013, at 8:59 PM, Lars Marowsky-Bree <lmb at suse.com> wrote:

> On 2013-06-28T18:41:35, Andrew Beekhof <andrew at beekhof.net> wrote:
> 
>>> There's an exception: dropping commonly used external interfaces (say,
>>> "ptest") needs to be announced a few releases in advance before enacted
>>> upstream. (And if Enterprise distributions want to keep something, they
>>> have time to prepare for that.) And of course, if major components get
>>> rewritten, they either need more testing or should be in place in
>>> parallel for 1 or 2 releases.
>> Now we start to diverge...
>> 
>> Keeping two lrmd's around? Two stonithd's?
> 
> Well, I can dream, can't I. ;-)

Of course.

> But perhaps you're right. The LRM
> rewrite taught us something about the perils of rewriting components
> that are badly documented and don't have good regression tests and where
> not all options they supported were written down somewhere.
> 
> But as an isolated component, would it have been so difficult to ship a
> separate implementation of the LRM first, perhaps as a compile time
> switch?

Honestly - more than likely it would have been.
Looking at the commit for just the glue between crmd and lrmd:

# git show --stat 6f8f559 crmd/lrm.c
commit 6f8f5594a940b015cf7fadb26c1da7e110d73103
Author: David Vossel <dvossel at redhat.com>
Date:   Wed May 30 17:42:37 2012 -0500

    High: crmd: Enable use of new lrmd daemon and client library in crmd.

 crmd/lrm.c | 687 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------------------------------------------------------------------------------------------------------------------------
 1 file changed, 274 insertions(+), 413 deletions(-)

there is certainly some easy stuff to mask:

* HA_OK -> lrmd_ok
* lrm_free_rsc() -> lrmd_free_rsc_info()

But there's also some fundamental changes to the crmd/lrmd interaction.

> (Assuming the interface to the component doesn't change so much.
> It could hardly have been worse than supporting all those different
> messaging APIs and their versions.)
> 
> The latter is, perhaps, not a bad example.

Not a bad example, but the different messaging APIs have/had an expected future lifespan of more than "a release or two".

> 
>> Or two copies of the PE after I rewrite ordering constraints? Urgh :-(
> 
> The PE is different; almost all of its features are documented and
> protected by strong regressions tests. That support for an option would
> be dropped by accident is almost unthinkable. Hence, the implementation
> can be considered almost entirely internal.
> 
> But people were using options that the new LRM no longer supported,
> called lrmadmin in some of their scripts, etc. So I think the
> differentiation between the PE and the LRM does exist.

Agreed.

> 
> Perhaps the lesson is "Write regression tests before a rewrite." (And
> I'm not saying it's a lesson that depended entirely on you or David. If
> cluster-glue's LRM had had such a suite, it'd certainly have helped
> tons.)

I think he did actually.
But given that some "features" were an undocumented distant memory, it was hard for him to know to write a test for it.

> The Linux kernel 3.x series seems to be coping quite nicely, too. They
> do have stable series to which they backport, though. That's always an
> option: if $someone feels the need to do longer support for, say,
> 1.1.10, they can always can help start 1.1.10.x.
> 
>> If that sort of thing wasn't such a PITA you'd have done it with 1.1.8.
> 
> Yeah, and there were some here who advocated this. Given the scope of
> the other changes at the time, I thought it better to integrate it via a
> different path into SLE HA.
> 
>> Which is the problem with the Firefox model - either there is no "good" time to make them, or users hate us because we can make them at any time.
> 
> For Firefox, though, I've never noticed a problem (and I'm an ardent
> follower of the updates). The exceptions are, of course, add-ons: so I
> don't update until the add-ons I depend on are also updated.
> 
>> Even broadcasting changes can have limited value.
>> To use a recent example, crmsh was left in place for well over a year (iirc) before it was dropped.
>> That didn't seem to help anything...
> 
> Probably a communication problem.
> 
> And the way how we "fixed" this on SLE HA was to pull in the new package
> via a dependency, so that users never noticed that we split the
> projects. Clearly, that's impossible to do when one chooses to drop a
> major component for good.
> 
>>> (Perpetuated by customers willing to pay for it, and because admittedly
>>> not all components have good test suites.)
>> Me too, but how do we do this where all the downside doesn't fall on me?
> 
> I'm not sure there's a huge downside in it for you?

Ok, lets take attrd for example - which I've been wanted to rewrite to be truly atomic for half a decade or more.

Under this model, not only do I have to find the time to write and test the new addition, but I also have to:
* keep maintaining the old code until... when?
* probably write and maintain a compatibility layer
* make it possible to choose which gets used (a small but annoying task)
* make it possible to figure out which was in use for support
* educate people that there are two, and when to chose one over the other
* answer copious emails from confused users

Thats starting to sound like a decent fraction of the overall effort. 

[/me puts on selfish hat]
And none of the extra work is of any benefit to me or RH.
The code got written because it replaces something that needed replacing and I'd not have included it in any kind of release if I didn't have good reason to believe it was as good or better than what it was replacing.
[/me takes off hat]

But of course everyone else is just can only take my word for the "as good or better" part, which is not ideal.

Perhaps its just a question of just making the -rc phase longer when there's a big change involved.

> You'd get to develop
> and bring forward pacemaker 2.x all you want - and if RHEL7 wanted to
> freeze a specific version, they'd support 2.x.y for that. (OK, so that
> would probably be you too, though.)

In my perfect world, under this model, RH would dip into the releases and take every 2nd or 3rd, whatever was ready at the time.

Btw. _IF_ we do this, I'd be wanting to go with Pacemaker-$x (no .y or .z).
We shouldn't create the impression of doing release series when we're not.