[ClusterLabs] Antw: Re: [Question] About a change of crm_failcount.
Ken Gaillot
kgaillot at redhat.com
Thu Feb 9 10:47:40 EST 2017
On 02/09/2017 05:46 AM, Jehan-Guillaume de Rorthais wrote:
> On Thu, 9 Feb 2017 19:24:22 +0900 (JST)
> renayama19661014 at ybb.ne.jp wrote:
>
>> Hi Ken,
>>
>>
>>> 1. Return a "hard" error such as OCF_ERR_ARGS or OCF_ERR_PERM. When
>>> Pacemaker gets one of these errors from an agent, it will ban the
>>> resource from that node (until the failure is cleared).
>>
>> The first suggestion does not work well.
>>
>> Even if this returns OCF_ERR_ARGS and OCF_ERR_PERM, it seems to be to be
>> pre_promote(notify) handling of RA. Pacemaker does not record the notify(pre
>> promote) error in CIB.
>>
>> * https://github.com/ClusterLabs/pacemaker/blob/master/crmd/lrm.c#L2411
>>
>> Because it is not recorded in CIB, there cannot be the thing that pengine
>> works as "hard error".
Ah, I didn't think of that.
> Indeed. That's why PAF use private attribute to give informations between
> actions. We detect the failure during the notify as well, but raise the error
> during the promotion itself. See how I dealt with this in PAF:
>
> https://github.com/ioguix/PAF/commit/6123025ff7cd9929b56c9af2faaefdf392886e68
That's a nice use of private attributes.
> As private attributes does not work on older stacks, you could rely on local
> temp file as well in $HA_RSCTMP.
>
>>> 2. Use crm_resource --ban instead. This would ban the resource from that
>>> node until the user removes the ban with crm_resource --clear (or by
>>> deleting the ban consraint from the configuration).
>>
>> The second suggestion works well.
>> I intend to adopt the second suggestion.
>>
>> As other methods, you think crm_resource -F to be available, but what do you
>> think? I think that last-failure does not have a problem either to let you
>> handle pseudotrouble if it is crm_resource -F.
>>
>> I think whether crm_resource -F is available, but adopt crm_resource -B
>> because RA wants to completely stop pgsql resource.
>>
>> ``` @pgsql RA
>>
>> pgsql_pre_promote() {
>> (snip)
>> if [ "$cmp_location" != "$my_master_baseline" ]; then
>> ocf_exit_reason "My data is newer than new master's one. New
>> master's location : $master_baseline" exec_with_retry 0 $CRM_RESOURCE -B -r
>> $OCF_RESOURCE_INSTANCE -N $NODENAME -Q return $OCF_ERR_GENERIC
>> fi
>> (snip)
>> CRM_FAILCOUNT="${HA_SBIN_DIR}/crm_failcount"
>> CRM_RESOURCE="${HA_SBIN_DIR}/crm_resource"
>> ```
>>
>> I test movement a little more and send a patch.
>
> I suppose crm_resource -F will just raise the failcount, break the current
> transition and the CRM will recompute another transition paying attention to
> your "failed" resource (will it try to recover it? retry the previous
> transition again?).
>
> I would bet on crm_resource -B.
Correct, crm_resource -F only simulates OCF_ERR_GENERIC, which is a soft
error. It might be a nice extension to be able to specify the error
code, but in this case, I think crm_resource -B (or the private
attribute approach, if you're OK with limiting it to corosync 2 and
pacemaker 1.1.13+) is better.
>> ----- Original Message -----
>>> From: Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>
>>> To: users at clusterlabs.org; kgaillot at redhat.com
>>> Cc:
>>> Date: 2017/2/6, Mon 17:44
>>> Subject: [ClusterLabs] Antw: Re: [Question] About a change of crm_failcount.
>>>
>>>>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 02.02.2017 um
>>> 19:33 in Nachricht
>>> <91a83571-9930-94fd-e635-96283067105c at redhat.com>:
>>>> On 02/02/2017 12:23 PM, renayama19661014 at ybb.ne.jp wrote:
>>>>> Hi All,
>>>>>
>>>>> By the next correction, the user was not able to set a value except
>>> zero in
>>>> crm_failcount.
>>>>>
>>>>> - [Fix: tools: implement crm_failcount command-line options correctly]
>>>>> -
>>>>
>>> https://github.com/ClusterLabs/pacemaker/commit/95db10602e8f646eefed335414e40
>>>> a994498cafd#diff-6e58482648938fd488a920b9902daac4
>>>>>
>>>>> However, pgsql RA sets INFINITY in a script.
>>>>>
>>>>> ```
>>>>> (snip)
>>>>> CRM_FAILCOUNT="${HA_SBIN_DIR}/crm_failcount"
>>>>> (snip)
>>>>> ocf_exit_reason "My data is newer than new master's one.
>>> New master's
>>>> location : $master_baseline"
>>>>> exec_with_retry 0 $CRM_FAILCOUNT -r $OCF_RESOURCE_INSTANCE -U
>>> $NODENAME -v
>>>> INFINITY
>>>>> return $OCF_ERR_GENERIC
>>>>> (snip)
>>>>> ```
>>>>>
>>>>> There seems to be the influence only in pgsql somehow or other.
>>>>>
>>>>> Can you revise it to set a value except zero in crm_failcount?
>>>>> We make modifications to use crm_attribute in pgsql RA if we cannot
>>> revise
>>>> it.
>>>>>
>>>>> Best Regards,
>>>>> Hideo Yamauchi.
>>>>
>>>> Hmm, I didn't realize that was used. I changed it because it's not
>>> a
>>>> good idea to set fail-count without also changing last-failure and
>>>> having a failed op in the LRM history. I'll have to think about what
>>> the
>>>> best alternative is.
>>>
>>> The question also is whether the RA can acieve the same effect otherwise. I
>>> thought CRM sets the failcount, not the RA...
More information about the Users
mailing list