[ClusterLabs] Antw: Re: Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

Fri May 20 07:02:58 EDT 2016

Le Fri, 20 May 2016 11:12:28 +0200,
"Ulrich Windl" <Ulrich.Windl at rz.uni-regensburg.de> a écrit :

> >>> Jehan-Guillaume de Rorthais <jgdr at dalibo.com> schrieb am 20.05.2016 um
> 09:59 in
> Nachricht <20160520095934.029c1822 at firost>:
> > Le Fri, 20 May 2016 08:39:42 +0200,
> > "Ulrich Windl" <Ulrich.Windl at rz.uni-regensburg.de> a écrit :
> > 
> >> >>> Jehan-Guillaume de Rorthais <jgdr at dalibo.com> schrieb am 19.05.2016 um
> >> >>> 21:29 in
> >> Nachricht <20160519212947.6cc0fd7b at firost>:
> >> [...]
> >> > I was thinking of a use case where a graceful demote or stop action
> failed
> >> > multiple times and to give a chance to the RA to choose another method to
> 
> >> > stop
> >> > the resource before it requires a migration. As instance, PostgreSQL has
> 3
> >> > different kind of stop, the last one being not graceful, but still better
> 
> >> > than
> >> > a kill -9.
> >> 
> >> For example the Xen RA tries a clean shutdown with a timeout of about 2/3
> of
> >> the timeout; it it fails it shuts the VM down the hard way.
> > 
> > Reading the Xen RA, I see they added a shutdown timeout escalation 
> > parameter.
> 
> Not quite:
>     if [ -n "$OCF_RESKEY_shutdown_timeout" ]; then
>       timeout=$OCF_RESKEY_shutdown_timeout
>     elif [ -n "$OCF_RESKEY_CRM_meta_timeout" ]; then
>       # Allow 2/3 of the action timeout for the orderly shutdown
>       # (The origin unit is ms, hence the conversion)
>       timeout=$((OCF_RESKEY_CRM_meta_timeout/1500))
>     else
>       timeout=60
>     fi
> 
> > This is a reasonable solution, but isn't it possible to get the action 
> > timeout
> > directly? I looked for such information in the past with no success.
> 
> See above.

Gosh, this is embarrassing...how could we miss that?

Thank you for pointing this!