[ClusterLabs] Antw: Re: [Question] About a change of crm_failcount.

Thu Feb 9 06:46:13 EST 2017

On Thu, 9 Feb 2017 19:24:22 +0900 (JST)
renayama19661014 at ybb.ne.jp wrote:

> Hi Ken,
> 
> 
> > 1. Return a "hard" error such as OCF_ERR_ARGS or OCF_ERR_PERM. When
> > Pacemaker gets one of these errors from an agent, it will ban the
> > resource from that node (until the failure is cleared).  
> 
> The first suggestion does not work well.
> 
> Even if this returns OCF_ERR_ARGS and OCF_ERR_PERM, it seems to be to be
> pre_promote(notify) handling of RA. Pacemaker does not record the notify(pre
> promote) error in CIB.
> 
>  * https://github.com/ClusterLabs/pacemaker/blob/master/crmd/lrm.c#L2411
> 
> Because it is not recorded in CIB, there cannot be the thing that pengine
> works as "hard error".

Indeed. That's why PAF use private attribute to give informations between
actions. We detect the failure during the notify as well, but raise the error
during the promotion itself. See how I dealt with this in PAF:

https://github.com/ioguix/PAF/commit/6123025ff7cd9929b56c9af2faaefdf392886e68

As private attributes does not work on older stacks, you could rely on local
temp file as well in $HA_RSCTMP.

> > 2. Use crm_resource --ban instead. This would ban the resource from that
> > node until the user removes the ban with crm_resource --clear (or by
> > deleting the ban consraint from the configuration).  
> 
> The second suggestion works well.
> I intend to adopt the second suggestion.
> 
> As other methods, you think crm_resource -F to be available, but what do you
> think? I think that last-failure does not have a problem either to let you
> handle pseudotrouble if it is crm_resource -F.
> 
> I think whether crm_resource -F is available, but adopt crm_resource -B
> because RA wants to completely stop pgsql resource.
> 
> ``` @pgsql RA
> 
> pgsql_pre_promote() {
> (snip)
>             if [ "$cmp_location" != "$my_master_baseline" ]; then
>                 ocf_exit_reason "My data is newer than new master's one. New
> master's location : $master_baseline" exec_with_retry 0 $CRM_RESOURCE -B -r
> $OCF_RESOURCE_INSTANCE -N $NODENAME -Q return $OCF_ERR_GENERIC
>             fi
> (snip)
>     CRM_FAILCOUNT="${HA_SBIN_DIR}/crm_failcount"
>     CRM_RESOURCE="${HA_SBIN_DIR}/crm_resource"
> ```
> 
> I test movement a little more and send a patch.

I suppose crm_resource -F will just raise the failcount, break the current
transition and the CRM will recompute another transition paying attention to
your "failed" resource (will it try to recover it? retry the previous
transition again?).

I would bet on crm_resource -B.

> ----- Original Message -----
> > From: Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>
> > To: users at clusterlabs.org; kgaillot at redhat.com
> > Cc: 
> > Date: 2017/2/6, Mon 17:44
> > Subject: [ClusterLabs] Antw: Re: [Question] About a change of crm_failcount.
> >   
> >>>>  Ken Gaillot <kgaillot at redhat.com> schrieb am 02.02.2017 um   
> > 19:33 in Nachricht
> > <91a83571-9930-94fd-e635-96283067105c at redhat.com>:  
> >>  On 02/02/2017 12:23 PM, renayama19661014 at ybb.ne.jp wrote:  
> >>>  Hi All,
> >>> 
> >>>  By the next correction, the user was not able to set a value except   
> > zero in   
> >>  crm_failcount.  
> >>> 
> >>>   - [Fix: tools: implement crm_failcount command-line options correctly]
> >>>     -   
> >>   
> > https://github.com/ClusterLabs/pacemaker/commit/95db10602e8f646eefed335414e40   
> >>  a994498cafd#diff-6e58482648938fd488a920b9902daac4  
> >>> 
> >>>  However, pgsql RA sets INFINITY in a script.
> >>> 
> >>>  ```
> >>>  (snip)
> >>>      CRM_FAILCOUNT="${HA_SBIN_DIR}/crm_failcount"
> >>>  (snip)
> >>>      ocf_exit_reason "My data is newer than new master's one.   
> > New   master's   
> >>  location : $master_baseline"  
> >>>      exec_with_retry 0 $CRM_FAILCOUNT -r $OCF_RESOURCE_INSTANCE -U   
> > $NODENAME -v   
> >>  INFINITY  
> >>>      return $OCF_ERR_GENERIC
> >>>  (snip)
> >>>  ```
> >>> 
> >>>  There seems to be the influence only in pgsql somehow or other.
> >>> 
> >>>  Can you revise it to set a value except zero in crm_failcount?
> >>>  We make modifications to use crm_attribute in pgsql RA if we cannot   
> > revise   
> >>  it.  
> >>> 
> >>>  Best Regards,
> >>>  Hideo Yamauchi.  
> >> 
> >>  Hmm, I didn't realize that was used. I changed it because it's not   
> > a  
> >>  good idea to set fail-count without also changing last-failure and
> >>  having a failed op in the LRM history. I'll have to think about what   
> > the  
> >>  best alternative is.  
> > 
> > The question also is whether the RA can acieve the same effect otherwise. I 
> > thought CRM sets the failcount, not the RA...
> >   

-- 
Jehan-Guillaume de Rorthais
Dalibo