[Pacemaker] Fail-count and failure timeout

Tue Oct 5 05:07:14 EDT 2010

On Fri, Oct 1, 2010 at 3:40 PM,  <Holger.Teutsch at fresenius-netcare.com> wrote:
> Hi,
> I observed the following in pacemaker Versions 1.1.3 and tip up to patch
> 10258.
>
> In a small test environment to study fail-count behavior I have one resource
>
> anything
> doing sleep 600 with monitoring interval 10 secs.
>
> The failure-timeout is 300.
>
> I would expect to never see a failcount higher than 1.

Why?

The fail-count is only reset when the PE runs... which is on a failure
and/or after the cluster-recheck-interval
So I'd expect a maximum of two.

       cluster-recheck-interval = time [15min]
              Polling interval for time based changes to options,
resource parameters and constraints.

              The Cluster is primarily event driven, however the
configuration can have elements that change based on time. To ensure
these changes take effect, we can optionally poll  the  cluster’s
              status for changes. Allowed values: Zero disables
polling. Positive values are an interval in seconds (unless other SI
units are specified. eg. 5min)

>
> I observed some sporadic clears but mostly the count is increasing by 1 each
> 10 minutes.
>
> Am I mistaken or is this a bug ?

Hard to say without logs.  What value did it reach?

>
> Regards
> Holger
>
> -- complete cib for reference ---
>
> <cib epoch="32" num_updates="0" admin_epoch="0"
> validate-with="pacemaker-1.2" crm_feature_set="3.0.4" have-quorum="0"
> cib-last-written="Fri Oct  1 14:17:31 2010" dc-uuid="hotlx">
>   <configuration>
>     <crm_config>
>       <cluster_property_set id="cib-bootstrap-options">
>         <nvpair id="cib-bootstrap-options-dc-version" name="dc-version"
> value="1.1.3-09640bd6069e677d5eed65203a6056d9bf562e67"/>
>         <nvpair id="cib-bootstrap-options-cluster-infrastructure"
> name="cluster-infrastructure" value="openais"/>
>         <nvpair id="cib-bootstrap-options-expected-quorum-votes"
> name="expected-quorum-votes" value="2"/>
>         <nvpair id="cib-bootstrap-options-no-quorum-policy"
> name="no-quorum-policy" value="ignore"/>
>         <nvpair id="cib-bootstrap-options-stonith-enabled"
> name="stonith-enabled" value="false"/>
>         <nvpair id="cib-bootstrap-options-start-failure-is-fatal"
> name="start-failure-is-fatal" value="false"/>
>         <nvpair id="cib-bootstrap-options-last-lrm-refresh"
> name="last-lrm-refresh" value="1285926879"/>
>       </cluster_property_set>
>     </crm_config>
>     <nodes>
>       <node id="hotlx" uname="hotlx" type="normal"/>
>     </nodes>
>     <resources>
>       <primitive class="ocf" id="test" provider="heartbeat" type="anything">
>         <meta_attributes id="test-meta_attributes">
>           <nvpair id="test-meta_attributes-target-role" name="target-role"
> value="started"/>
>           <nvpair id="test-meta_attributes-failure-timeout"
> name="failure-timeout" value="300"/>
>         </meta_attributes>
>         <operations id="test-operations">
>           <op id="test-op-monitor-10" interval="10" name="monitor"
> on-fail="restart" timeout="20s"/>
>           <op id="test-op-start-0" interval="0" name="start"
> on-fail="restart" timeout="20s"/>
>         </operations>
>         <instance_attributes id="test-instance_attributes">
>           <nvpair id="test-instance_attributes-binfile" name="binfile"
> value="sleep 600"/>
>         </instance_attributes>
>       </primitive>
>     </resources>
>     <constraints/>
>   </configuration>
> </cib>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>