[ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

Fri Dec 15 09:42:03 EST 2017

On 20/05/16 17:04 +0100, Adam Spiers wrote:
> Klaus Wenninger <kwenning at redhat.com> wrote:
>> On 05/20/2016 08:39 AM, Ulrich Windl wrote:
>>> I think RAs should not rely on "stop" being called multiple times
>>> for a resource to be stopped.
> 
> Well, this would be a major architectural change.  Currently if
> stop fails once, the node gets fenced - period.  So if we changed
> this, there would presumably be quite a bit of scope for making the
> new design address whatever concerns you have about relying on "stop"
> *sometimes* needing to be called multiple times.  For the sake of
> backwards compatibility with existing RAs, I think we'd have to ensure
> the current semantics still work.  But maybe there could be a new
> option where RAs are allowed to return OCF_RETRY_STOP to indicate that
> they want to escalate, or something.  However it's not clear how that
> would be distinguished from an old RA returning the same value as
> whatever we chose for OCF_RETRY_STOP.
> 
>> I see a couple of positive points in having something inside pacemaker
>> that helps the RAs escalating their stop strategy:
>> 
>> - this way you have the same logging for all RAs - done within the
>>   RA it would look different with each of them
>> - timeout-retry stuff is potentially prone to not being implemented
>>   properly - like this you have a proven
>>   implementation within pacemaker
>> - keeps logic within RA simpler and guides implementation in
>>   a certain direction that makes them look more similar to each
>>   other making it easier to understand an RA you haven't seen
>>   before
> 
> Yes, all good points which I agree with.
> 
>> Of course there are basically two approaches to achieve this:
>> 
>> - give some global or per resource view of pacemaker to the RA and leave
>>   it to the RA to act in a responsible manner (like telling the RA
>>   that there are x stop-retries to come)
>> - handle the escalation withing pacemaker and already tell the RA
>>   what you expect it to do like requesting a graceful / hard /
>>   emergency or however you would call it stop
> 
> I'd probably prefer the former, to avoid hardcoding any assumptions
> about the different levels of escalation the RA might want to take.
> That would almost certainly vary per RA.

I'd like to point out the direction of just-released systemd 236 to
solve "what if action needs more time to finish than permitted":

> The sd_notify() protocol can now with EXTEND_TIMEOUT_USEC=microsecond
> extend the effective start, runtime, and stop time. The service must
> continue to send EXTEND_TIMEOUT_USEC within the period specified to
> prevent the service manager from making the service as timedout.

It apparently does not solve "cannot wait forever otherwise degrading
availability" off the bat, is not well suited for the current
agent-driven, synchronous+sequenced supervision model (which, since
beginning, was not planned to remain the final state-of-art[1],
though), but looks simple enough and is quite close to
OCF_RETRY_STOP idea proposed above.

[1] https://github.com/ClusterLabs/OCF-spec/commit/2331bb8d3624a2697afaf3429cec1f47d19251f5#diff-316ade5241704833815c8fa2c2b71d4dR422

> However, we're slightly off-topic for this thread at this point ;-)

(It's all one big Gordian knot, all is related, and that we are not
starting with a clean drawing board but are rolling some stones
ahead of us already is not helping.)

-- 
Poki
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20171215/74f89d43/attachment-0002.sig>