[ClusterLabs] Antw: Re: Antw: Re: Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

Fri May 20 09:12:28 UTC 2016

>>> Jehan-Guillaume de Rorthais <jgdr at dalibo.com> schrieb am 20.05.2016 um
09:59 in
Nachricht <20160520095934.029c1822 at firost>:
> Le Fri, 20 May 2016 08:39:42 +0200,
> "Ulrich Windl" <Ulrich.Windl at rz.uni-regensburg.de> a écrit :
> 
>> >>> Jehan-Guillaume de Rorthais <jgdr at dalibo.com> schrieb am 19.05.2016 um
>> >>> 21:29 in
>> Nachricht <20160519212947.6cc0fd7b at firost>:
>> [...]
>> > I was thinking of a use case where a graceful demote or stop action
failed
>> > multiple times and to give a chance to the RA to choose another method to

>> > stop
>> > the resource before it requires a migration. As instance, PostgreSQL has
3
>> > different kind of stop, the last one being not graceful, but still better

>> > than
>> > a kill -9.
>> 
>> For example the Xen RA tries a clean shutdown with a timeout of about 2/3
of
>> the timeout; it it fails it shuts the VM down the hard way.
> 
> Reading the Xen RA, I see they added a shutdown timeout escalation 
> parameter.

Not quite:
    if [ -n "$OCF_RESKEY_shutdown_timeout" ]; then
      timeout=$OCF_RESKEY_shutdown_timeout
    elif [ -n "$OCF_RESKEY_CRM_meta_timeout" ]; then
      # Allow 2/3 of the action timeout for the orderly shutdown
      # (The origin unit is ms, hence the conversion)
      timeout=$((OCF_RESKEY_CRM_meta_timeout/1500))
    else
      timeout=60
    fi

> This is a reasonable solution, but isn't it possible to get the action 
> timeout
> directly? I looked for such information in the past with no success.

See above.

> 
>> 
>> I don't know Postgres in detail, but I could imagine a three step
approach:
>> 1) Shutdown after current operations have finished
>> 2) Shutdown regardless of pending operations (doing rollbacks)
>> 3) Shutdown the hard way, requiring recovery on the next start (I think in
>> Oracle this is called a "shutdown abort")
> 
> Exactly.
> 
>> Depending on the scenario one may start at step 2)
> 
> Indeed.
>  
>> [...]
>> I think RAs should not rely on "stop" being called multiple times for a
>> resource to be stopped.
> 
> Ok, so the RA should take care of their own escalation during a single 
> action.
> 
> Thanks, 

Regards,
Ulrich