[ClusterLabs] Antw: Re: [Question] About a change of crm_failcount.

Thu Feb 9 10:47:40 EST 2017

On 02/09/2017 05:46 AM, Jehan-Guillaume de Rorthais wrote:
> On Thu, 9 Feb 2017 19:24:22 +0900 (JST)
> renayama19661014 at ybb.ne.jp wrote:
> 
>> Hi Ken,
>>
>>
>>> 1. Return a "hard" error such as OCF_ERR_ARGS or OCF_ERR_PERM. When
>>> Pacemaker gets one of these errors from an agent, it will ban the
>>> resource from that node (until the failure is cleared).  
>>
>> The first suggestion does not work well.
>>
>> Even if this returns OCF_ERR_ARGS and OCF_ERR_PERM, it seems to be to be
>> pre_promote(notify) handling of RA. Pacemaker does not record the notify(pre
>> promote) error in CIB.
>>
>>  * https://github.com/ClusterLabs/pacemaker/blob/master/crmd/lrm.c#L2411
>>
>> Because it is not recorded in CIB, there cannot be the thing that pengine
>> works as "hard error".

Ah, I didn't think of that.

> Indeed. That's why PAF use private attribute to give informations between
> actions. We detect the failure during the notify as well, but raise the error
> during the promotion itself. See how I dealt with this in PAF:
> 
> https://github.com/ioguix/PAF/commit/6123025ff7cd9929b56c9af2faaefdf392886e68

That's a nice use of private attributes.

> As private attributes does not work on older stacks, you could rely on local
> temp file as well in $HA_RSCTMP.
> 
>>> 2. Use crm_resource --ban instead. This would ban the resource from that
>>> node until the user removes the ban with crm_resource --clear (or by
>>> deleting the ban consraint from the configuration).  
>>
>> The second suggestion works well.
>> I intend to adopt the second suggestion.
>>
>> As other methods, you think crm_resource -F to be available, but what do you
>> think? I think that last-failure does not have a problem either to let you
>> handle pseudotrouble if it is crm_resource -F.
>>
>> I think whether crm_resource -F is available, but adopt crm_resource -B
>> because RA wants to completely stop pgsql resource.
>>
>> ``` @pgsql RA
>>
>> pgsql_pre_promote() {
>> (snip)
>>             if [ "$cmp_location" != "$my_master_baseline" ]; then
>>                 ocf_exit_reason "My data is newer than new master's one. New
>> master's location : $master_baseline" exec_with_retry 0 $CRM_RESOURCE -B -r
>> $OCF_RESOURCE_INSTANCE -N $NODENAME -Q return $OCF_ERR_GENERIC
>>             fi
>> (snip)
>>     CRM_FAILCOUNT="${HA_SBIN_DIR}/crm_failcount"
>>     CRM_RESOURCE="${HA_SBIN_DIR}/crm_resource"
>> ```
>>
>> I test movement a little more and send a patch.
> 
> I suppose crm_resource -F will just raise the failcount, break the current
> transition and the CRM will recompute another transition paying attention to
> your "failed" resource (will it try to recover it? retry the previous
> transition again?).
> 
> I would bet on crm_resource -B.

Correct, crm_resource -F only simulates OCF_ERR_GENERIC, which is a soft
error. It might be a nice extension to be able to specify the error
code, but in this case, I think crm_resource -B (or the private
attribute approach, if you're OK with limiting it to corosync 2 and
pacemaker 1.1.13+) is better.

>> ----- Original Message -----
>>> From: Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>
>>> To: users at clusterlabs.org; kgaillot at redhat.com
>>> Cc: 
>>> Date: 2017/2/6, Mon 17:44
>>> Subject: [ClusterLabs] Antw: Re: [Question] About a change of crm_failcount.
>>>   
>>>>>>  Ken Gaillot <kgaillot at redhat.com> schrieb am 02.02.2017 um   
>>> 19:33 in Nachricht
>>> <91a83571-9930-94fd-e635-96283067105c at redhat.com>:  
>>>>  On 02/02/2017 12:23 PM, renayama19661014 at ybb.ne.jp wrote:  
>>>>>  Hi All,
>>>>>
>>>>>  By the next correction, the user was not able to set a value except   
>>> zero in   
>>>>  crm_failcount.  
>>>>>
>>>>>   - [Fix: tools: implement crm_failcount command-line options correctly]
>>>>>     -   
>>>>   
>>> https://github.com/ClusterLabs/pacemaker/commit/95db10602e8f646eefed335414e40   
>>>>  a994498cafd#diff-6e58482648938fd488a920b9902daac4  
>>>>>
>>>>>  However, pgsql RA sets INFINITY in a script.
>>>>>
>>>>>  ```
>>>>>  (snip)
>>>>>      CRM_FAILCOUNT="${HA_SBIN_DIR}/crm_failcount"
>>>>>  (snip)
>>>>>      ocf_exit_reason "My data is newer than new master's one.   
>>> New   master's   
>>>>  location : $master_baseline"  
>>>>>      exec_with_retry 0 $CRM_FAILCOUNT -r $OCF_RESOURCE_INSTANCE -U   
>>> $NODENAME -v   
>>>>  INFINITY  
>>>>>      return $OCF_ERR_GENERIC
>>>>>  (snip)
>>>>>  ```
>>>>>
>>>>>  There seems to be the influence only in pgsql somehow or other.
>>>>>
>>>>>  Can you revise it to set a value except zero in crm_failcount?
>>>>>  We make modifications to use crm_attribute in pgsql RA if we cannot   
>>> revise   
>>>>  it.  
>>>>>
>>>>>  Best Regards,
>>>>>  Hideo Yamauchi.  
>>>>
>>>>  Hmm, I didn't realize that was used. I changed it because it's not   
>>> a  
>>>>  good idea to set fail-count without also changing last-failure and
>>>>  having a failed op in the LRM history. I'll have to think about what   
>>> the  
>>>>  best alternative is.  
>>>
>>> The question also is whether the RA can acieve the same effect otherwise. I 
>>> thought CRM sets the failcount, not the RA...