[Pacemaker] Fail-count and failure timeout

Tue Oct 5 06:59:32 EDT 2010

The resource failed when the sleep expired, i.e. each 600 secs.
Now I changed the resource to

sleep 7200, failure-timeout 3600

i.e. to values far beyond the recheck-interval opf 15m.

Now everything behaves as expected.

Mit freundlichen Grüßen / Kind regards 

Holger Teutsch 

From:   Andrew Beekhof <andrew at beekhof.net>
To:     The Pacemaker cluster resource manager 
<pacemaker at oss.clusterlabs.org>
Date:   05.10.2010 11:09
Subject:        Re: [Pacemaker] Fail-count and failure timeout

On Tue, Oct 5, 2010 at 11:07 AM, Andrew Beekhof <andrew at beekhof.net> 
wrote:
> On Fri, Oct 1, 2010 at 3:40 PM,  <Holger.Teutsch at fresenius-netcare.com> 
wrote:
>> Hi,
>> I observed the following in pacemaker Versions 1.1.3 and tip up to 
patch
>> 10258.
>>
>> In a small test environment to study fail-count behavior I have one 
resource
>>
>> anything
>> doing sleep 600 with monitoring interval 10 secs.
>>
>> The failure-timeout is 300.
>>
>> I would expect to never see a failcount higher than 1.
>
> Why?
>
> The fail-count is only reset when the PE runs... which is on a failure
> and/or after the cluster-recheck-interval
> So I'd expect a maximum of two.

Actually this is wrong.
There is no maximum, because there needs to have been 300s since the
last failure when the PE runs.
And since it only runs when the resource fails, it is never reset.

>
>       cluster-recheck-interval = time [15min]
>              Polling interval for time based changes to options,
> resource parameters and constraints.
>
>              The Cluster is primarily event driven, however the
> configuration can have elements that change based on time. To ensure
> these changes take effect, we can optionally poll  the  cluster’s
>              status for changes. Allowed values: Zero disables
> polling. Positive values are an interval in seconds (unless other SI
> units are specified. eg. 5min)
>
>
>
>>
>> I observed some sporadic clears but mostly the count is increasing by 1 
each
>> 10 minutes.
>>
>> Am I mistaken or is this a bug ?
>
> Hard to say without logs.  What value did it reach?
>
>>
>> Regards
>> Holger
>>
>> -- complete cib for reference ---
>>
>> <cib epoch="32" num_updates="0" admin_epoch="0"
>> validate-with="pacemaker-1.2" crm_feature_set="3.0.4" have-quorum="0"
>> cib-last-written="Fri Oct  1 14:17:31 2010" dc-uuid="hotlx">
>>   <configuration>
>>     <crm_config>
>>       <cluster_property_set id="cib-bootstrap-options">
>>         <nvpair id="cib-bootstrap-options-dc-version" name="dc-version"
>> value="1.1.3-09640bd6069e677d5eed65203a6056d9bf562e67"/>
>>         <nvpair id="cib-bootstrap-options-cluster-infrastructure"
>> name="cluster-infrastructure" value="openais"/>
>>         <nvpair id="cib-bootstrap-options-expected-quorum-votes"
>> name="expected-quorum-votes" value="2"/>
>>         <nvpair id="cib-bootstrap-options-no-quorum-policy"
>> name="no-quorum-policy" value="ignore"/>
>>         <nvpair id="cib-bootstrap-options-stonith-enabled"
>> name="stonith-enabled" value="false"/>
>>         <nvpair id="cib-bootstrap-options-start-failure-is-fatal"
>> name="start-failure-is-fatal" value="false"/>
>>         <nvpair id="cib-bootstrap-options-last-lrm-refresh"
>> name="last-lrm-refresh" value="1285926879"/>
>>       </cluster_property_set>
>>     </crm_config>
>>     <nodes>
>>       <node id="hotlx" uname="hotlx" type="normal"/>
>>     </nodes>
>>     <resources>
>>       <primitive class="ocf" id="test" provider="heartbeat" 
type="anything">
>>         <meta_attributes id="test-meta_attributes">
>>           <nvpair id="test-meta_attributes-target-role" 
name="target-role"
>> value="started"/>
>>           <nvpair id="test-meta_attributes-failure-timeout"
>> name="failure-timeout" value="300"/>
>>         </meta_attributes>
>>         <operations id="test-operations">
>>           <op id="test-op-monitor-10" interval="10" name="monitor"
>> on-fail="restart" timeout="20s"/>
>>           <op id="test-op-start-0" interval="0" name="start"
>> on-fail="restart" timeout="20s"/>
>>         </operations>
>>         <instance_attributes id="test-instance_attributes">
>>           <nvpair id="test-instance_attributes-binfile" name="binfile"
>> value="sleep 600"/>
>>         </instance_attributes>
>>       </primitive>
>>     </resources>
>>     <constraints/>
>>   </configuration>
>> </cib>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs:
>> 
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>>
>

_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: 
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20101005/5342e82c/attachment-0001.html>