[ClusterLabs] [Question] About a change of crm_failcount.

Fri Feb 3 11:02:34 EST 2017

On Fri, 3 Feb 2017 09:45:18 -0600
Ken Gaillot <kgaillot at redhat.com> wrote:

> On 02/02/2017 12:33 PM, Ken Gaillot wrote:
> > On 02/02/2017 12:23 PM, renayama19661014 at ybb.ne.jp wrote:  
> >> Hi All,
> >>
> >> By the next correction, the user was not able to set a value except zero
> >> in crm_failcount.
> >>
> >>  - [Fix: tools: implement crm_failcount command-line options correctly]
> >>    -
> >> https://github.com/ClusterLabs/pacemaker/commit/95db10602e8f646eefed335414e40a994498cafd#diff-6e58482648938fd488a920b9902daac4
> >>
> >> However, pgsql RA sets INFINITY in a script.
> >>
> >> ```
> >> (snip)
> >>     CRM_FAILCOUNT="${HA_SBIN_DIR}/crm_failcount"
> >> (snip)
> >>     ocf_exit_reason "My data is newer than new master's one. New
> >> master's location : $master_baseline" exec_with_retry 0 $CRM_FAILCOUNT -r
> >> $OCF_RESOURCE_INSTANCE -U $NODENAME -v INFINITY return $OCF_ERR_GENERIC
> >> (snip)
> >> ```
> >>
> >> There seems to be the influence only in pgsql somehow or other.
> >>
> >> Can you revise it to set a value except zero in crm_failcount?
> >> We make modifications to use crm_attribute in pgsql RA if we cannot revise
> >> it.
> >>
> >> Best Regards,
> >> Hideo Yamauchi.  
> > 
> > Hmm, I didn't realize that was used. I changed it because it's not a
> > good idea to set fail-count without also changing last-failure and
> > having a failed op in the LRM history. I'll have to think about what the
> > best alternative is.  
> 
> Having a resource agent modify its own fail count is not a good idea,
> and could lead to unpredictable behavior. I didn't realize the pgsql
> agent did that.
> 
> I don't want to re-enable the functionality, because I don't want to
> encourage more agents doing this.
> 
> There are two alternatives the pgsql agent can choose from:
> 
> 1. Return a "hard" error such as OCF_ERR_ARGS or OCF_ERR_PERM. When
> Pacemaker gets one of these errors from an agent, it will ban the
> resource from that node (until the failure is cleared).
> 
> 2. Use crm_resource --ban instead. This would ban the resource from that
> node until the user removes the ban with crm_resource --clear (or by
> deleting the ban consraint from the configuration).
> 
> I'd recommend #1 since it does not require any pacemaker-specific tools.
> 
> We can make sure resource-agents has a fix for this before we release a
> new version of Pacemaker. We'll have to publicize as much as possible to
> pgsql users that they should upgrade resource-agents before or at the
> same time as pacemaker. I see the alternative PAF agent has the same
> usage, so it will need to be updated, too.

Yes, I was following this conversation.

I'll do the fix on our side.

Thank you!