[ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

Wed Jun 8 00:23:12 EDT 2016

On Wed, Jun 8, 2016 at 10:29 AM, Andrew Beekhof <abeekhof at redhat.com> wrote:
> On Wed, Jun 8, 2016 at 12:11 AM, Adam Spiers <aspiers at suse.com> wrote:
>> Ken Gaillot <kgaillot at redhat.com> wrote:
>>> On 06/06/2016 05:45 PM, Adam Spiers wrote:
>>> > Adam Spiers <aspiers at suse.com> wrote:
>>> >> Andrew Beekhof <abeekhof at redhat.com> wrote:
>>> >>> On Tue, Jun 7, 2016 at 8:29 AM, Adam Spiers <aspiers at suse.com> wrote:
>>> >>>> Ken Gaillot <kgaillot at redhat.com> wrote:
>>> >>>>> My main question is how useful would it actually be in the proposed use
>>> >>>>> cases. Considering the possibility that the expected start might never
>>> >>>>> happen (or fail), can an RA really do anything different if
>>> >>>>> start_expected=true?
>>> >>>>
>>> >>>> That's the wrong question :-)
>>> >>>>
>>> >>>>> If the use case is there, I have no problem with
>>> >>>>> adding it, but I want to make sure it's worthwhile.
>>> >>>>
>>> >>>> The use case which started this whole thread is for
>>> >>>> start_expected=false, not start_expected=true.
>>> >>>
>>> >>> Isn't this just two sides of the same coin?
>>> >>> If you're not doing the same thing for both cases, then you're just
>>> >>> reversing the order of the clauses.
>>> >>
>>> >> No, because the stated concern about unreliable expectations
>>> >> ("Considering the possibility that the expected start might never
>>> >> happen (or fail)") was regarding start_expected=true, and that's the
>>> >> side of the coin we don't care about, so it doesn't matter if it's
>>> >> unreliable.
>>> >
>>> > BTW, if the expected start happens but fails, then Pacemaker will just
>>> > keep repeating until migration-threshold is hit, at which point it
>>> > will call the RA 'stop' action finally with start_expected=false.
>>> > So that's of no concern.
>>>
>>> To clarify, that's configurable, via start-failure-is-fatal and on-fail
>>
>> Sure.
>>
>>> > Maybe your point was that if the expected start never happens (so
>>> > never even gets a chance to fail), we still want to do a nova
>>> > service-disable?
>>>
>>> That is a good question, which might mean it should be done on every
>>> stop -- or could that cause problems (besides delays)?
>>
>> No, the whole point of adding this feature is to avoid a
>> service-disable on every stop, and instead only do it on the final
>> stop.  If there are corner cases where we never reach the final stop,
>> that's not a disaster because nova will eventually figure it out and
>> do the right thing when the server-agent connection times out.
>>
>>> Another aspect of this is that the proposed feature could only look at a
>>> single transition. What if stop is called with start_expected=false, but
>>> then Pacemaker is able to start the service on the same node in the next
>>> transition immediately afterward? Would having called service-disable
>>> cause problems for that start?
>>
>> We would also need to ensure that service-enable is called on start
>> when necessary.  Perhaps we could track the enable/disable state in a
>> local temporary file, and if the file indicates that we've previously
>> done service-disable, we know to run service-enable on start.  This
>> would avoid calling service-enable on every single start.
>
> feels like an over-optimization
> in fact, the whole thing feels like that if i'm honest.

Today the stars aligned :-)

   http://xkcd.com/1691/

>
> why are we trying to optimise the projected performance impact when
> the system is in terrible shape already?
>
>>
>>> > Yes that would be nice, but this proposal was never intended to
>>> > address that.  I guess we'd need an entirely different mechanism in
>>> > Pacemaker for that.  But let's not allow perfection to become the
>>> > enemy of the good ;-)
>>>
>>> The ultimate concern is that this will encourage people to write RAs
>>> that leave services in a dangerous state after stop is called.
>>
>> I don't see why it would.
>
> Previous experience suggests it definitely will.
>
> People will do exactly what you're thinking but with something important.
> They'll see it behaves as they expect in best-case testing and never
> think about the corner cases.
> Then they'll start thinking about optimising their start operations,
> write some "optimistic" state recording code and break those too.
>
> Imagine a bug in your state recording code (maybe you forget to handle
> a missing state file after reboot) that means the 'enable' does't get
> run.  The service is up, but nova will never use it.
>
>> The new feature will be obscure enough that
>> noone would be able to use it without reading the corresponding
>> documentation first anyway.
>
> I like your optimism.
>
>>
>>> I think with naming and documenting it properly, I'm fine to provide the
>>> option, but I'm on the fence. Beekhof needs a little more convincing :-)
>>
>> Can you provide an example of a potential real-world situation where
>> an RA author would end up accidentally abusing the feature?
>
> You want a real-world example of how someone could accidentally
> mis-using a feature that doesn't exist yet?
>
> Um... if we knew all the weird and wonderful ways people break our
> code we'd be able to build a better mouse trap.
>
>>
>> Thanks a lot for your continued attention on this!
>>
>> Adam
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org