[ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

Wed Jun 8 00:29:50 UTC 2016

On Wed, Jun 8, 2016 at 12:11 AM, Adam Spiers <aspiers at suse.com> wrote:
> Ken Gaillot <kgaillot at redhat.com> wrote:
>> On 06/06/2016 05:45 PM, Adam Spiers wrote:
>> > Adam Spiers <aspiers at suse.com> wrote:
>> >> Andrew Beekhof <abeekhof at redhat.com> wrote:
>> >>> On Tue, Jun 7, 2016 at 8:29 AM, Adam Spiers <aspiers at suse.com> wrote:
>> >>>> Ken Gaillot <kgaillot at redhat.com> wrote:
>> >>>>> My main question is how useful would it actually be in the proposed use
>> >>>>> cases. Considering the possibility that the expected start might never
>> >>>>> happen (or fail), can an RA really do anything different if
>> >>>>> start_expected=true?
>> >>>>
>> >>>> That's the wrong question :-)
>> >>>>
>> >>>>> If the use case is there, I have no problem with
>> >>>>> adding it, but I want to make sure it's worthwhile.
>> >>>>
>> >>>> The use case which started this whole thread is for
>> >>>> start_expected=false, not start_expected=true.
>> >>>
>> >>> Isn't this just two sides of the same coin?
>> >>> If you're not doing the same thing for both cases, then you're just
>> >>> reversing the order of the clauses.
>> >>
>> >> No, because the stated concern about unreliable expectations
>> >> ("Considering the possibility that the expected start might never
>> >> happen (or fail)") was regarding start_expected=true, and that's the
>> >> side of the coin we don't care about, so it doesn't matter if it's
>> >> unreliable.
>> >
>> > BTW, if the expected start happens but fails, then Pacemaker will just
>> > keep repeating until migration-threshold is hit, at which point it
>> > will call the RA 'stop' action finally with start_expected=false.
>> > So that's of no concern.
>>
>> To clarify, that's configurable, via start-failure-is-fatal and on-fail
>
> Sure.
>
>> > Maybe your point was that if the expected start never happens (so
>> > never even gets a chance to fail), we still want to do a nova
>> > service-disable?
>>
>> That is a good question, which might mean it should be done on every
>> stop -- or could that cause problems (besides delays)?
>
> No, the whole point of adding this feature is to avoid a
> service-disable on every stop, and instead only do it on the final
> stop.  If there are corner cases where we never reach the final stop,
> that's not a disaster because nova will eventually figure it out and
> do the right thing when the server-agent connection times out.
>
>> Another aspect of this is that the proposed feature could only look at a
>> single transition. What if stop is called with start_expected=false, but
>> then Pacemaker is able to start the service on the same node in the next
>> transition immediately afterward? Would having called service-disable
>> cause problems for that start?
>
> We would also need to ensure that service-enable is called on start
> when necessary.  Perhaps we could track the enable/disable state in a
> local temporary file, and if the file indicates that we've previously
> done service-disable, we know to run service-enable on start.  This
> would avoid calling service-enable on every single start.

feels like an over-optimization
in fact, the whole thing feels like that if i'm honest.

why are we trying to optimise the projected performance impact when
the system is in terrible shape already?

>
>> > Yes that would be nice, but this proposal was never intended to
>> > address that.  I guess we'd need an entirely different mechanism in
>> > Pacemaker for that.  But let's not allow perfection to become the
>> > enemy of the good ;-)
>>
>> The ultimate concern is that this will encourage people to write RAs
>> that leave services in a dangerous state after stop is called.
>
> I don't see why it would.

Previous experience suggests it definitely will.

People will do exactly what you're thinking but with something important.
They'll see it behaves as they expect in best-case testing and never
think about the corner cases.
Then they'll start thinking about optimising their start operations,
write some "optimistic" state recording code and break those too.

Imagine a bug in your state recording code (maybe you forget to handle
a missing state file after reboot) that means the 'enable' does't get
run.  The service is up, but nova will never use it.

> The new feature will be obscure enough that
> noone would be able to use it without reading the corresponding
> documentation first anyway.

I like your optimism.

>
>> I think with naming and documenting it properly, I'm fine to provide the
>> option, but I'm on the fence. Beekhof needs a little more convincing :-)
>
> Can you provide an example of a potential real-world situation where
> an RA author would end up accidentally abusing the feature?

You want a real-world example of how someone could accidentally
mis-using a feature that doesn't exist yet?

Um... if we knew all the weird and wonderful ways people break our
code we'd be able to build a better mouse trap.

>
> Thanks a lot for your continued attention on this!
>
> Adam
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org