[Pacemaker] Expired fail-count doesn't get cleaned up.

Mon Aug 13 20:01:49 EDT 2012

On Tue, Jul 31, 2012 at 7:36 PM, David Coulson <david at davidcoulson.net> wrote:
> I'm running RHEL6 with the tech preview of pacemaker it ships with. I've a
> number of resources which have a failure-timeout="60", which most of the
> time does what it is supposed to.
>
> Last night a resource failed, which was part of a clone - While the resource
> recovered, the fail-count log never got cleaned up. Around every second the
> DC logged the pengine message below. I manually did a resource cleanup, and
> it seems happy now. Is there something I should be looking for in the logs
> to indicate that it 'missed' expiring this?

You might be experiencing:
+ David Vossel (5 months ago) 9263480: Low: pengine: cl#5025 -
Automatically clear failures when resource configuration changes.

But if you send us a crm_report tarball coving the period during which
you had problems, we can check.

>
> Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
>
> Migration summary:
> * Node dresproddns01:
>    re-openfire-lsb:0: migration-threshold=1000000 fail-count=1
> last-failure='Mon Jul 30 21:57:53 2012'
> * Node dresproddns02:
>
>
> Jul 31 05:32:34 dresproddns02 pengine: [2860]: notice: get_failcount:
> Failcount for cl-openfire on dresproddns01 has expired (limit was 60s)
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org