[Pacemaker] failcount not being reset?

Mon Jan 31 16:13:33 EST 2011

On Mon, Jan 31, 2011 at 10:12 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
> On Mon, Jan 31, 2011 at 4:51 PM, Anton Altaparmakov <aia21 at cam.ac.uk> wrote:
>> Hi,
>>
>> After a monitor action failure the failcount is not being reset despite everything I am aware off being configured, i.e. I have set (copied from "crm configure show"):
>
> Thats a 1.1 feature

In 1.0 they get ignored after the timeout but not reset (so the next
failure will put you back over the limit).

>
>> property \
>>        cluster-recheck-interval="60s"
>> rsc_defaults $id="rsc-options" \
>>        failure-timeout="60s"
>>
>> Yes "crm_mon --failcounts" shows:
>>
>> * Node nessie:
>>   res_drbd:0: migration-threshold=1000000 fail-count=2 last-failure='Mon Jan 31 14:27:14 2011'
>>
>> However the logs say that:
>>
>> Jan 31 15:41:27 nessie pengine: [1070]: info: get_failcount: ms_drbd has failed 2 times on nessie
>> Jan 31 15:41:27 nessie pengine: [1070]: notice: get_failcount: Failcount for ms_drbd on nessie has expired (limit was 60s)
>>
>> So why does fail-count not go back to zero and disappear?  Am I doing something wrong?  Is it broken?  Am I missing some option?
>>
>> Note this is running on Ubuntu 10.04.1 LTS and the relevant packages are:
>>
>> pacemaker 1.0.8+hg15494-2ubuntu2
>> corosync 1.2.0-0ubuntu1
>> drbd8-utils 2:8.3.7-1ubuntu2.1
>>
>> And here is the full configuration (crm configure show):
>>
>> node hydra
>> node nessie
>> node qs1
>> primitive res_drbd ocf:linbit:drbd \
>>        params drbd_resource="dev-vmstore" \
>>        meta target-role="Started" \
>>        op monitor interval="9s" role="Master" on-fail="restart" \
>>        op monitor interval="10s" role="Slave" on-fail="restart"
>> primitive res_filesystem ocf:heartbeat:Filesystem \
>>        params fstype="xfs" device="/dev/drbd0" directory="/dev-vmstore" options="noatime,barrier,largeio,logbufs=8,logbsize=256k,swalloc" \
>>        meta target-role="Started" \
>>        op monitor on-fail="restart" interval="10s"
>> primitive res_ip ocf:heartbeat:IPaddr2 \
>>        params ip="172.28.208.19" cidr_netmask="24" broadcast="172.28.208.255" \
>>        meta target-role="Started" \
>>        op monitor on-fail="restart" interval="10s"
>> primitive res_nfs_server lsb:nfs-kernel-server \
>>        meta target-role="Started" \
>>        op monitor on-fail="restart" interval="10s"
>> group group_dev-vmstore res_filesystem res_nfs_server res_ip
>> ms ms_drbd res_drbd \
>>        meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" globally_unique="false"
>> location loc_dev-vmstore_hydra group_dev-vmstore 0: hydra
>> location loc_dev-vmstore_nessie group_dev-vmstore 0: nessie
>> location loc_drbd_hydra ms_drbd 0: hydra
>> location loc_drbd_nessie ms_drbd 0: nessie
>> colocation col_dev-vmstore inf: group_dev-vmstore ms_drbd:Master
>> order order_dev-vmstore inf: ms_drbd:promote group_dev-vmstore:start
>> property $id="cib-bootstrap-options" \
>>        dc-version="1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd" \
>>        cluster-infrastructure="openais" \
>>        expected-quorum-votes="3" \
>>        stonith-enabled="false" \
>>        no-quorum-policy="stop" \
>>        symmetric-cluster="false" \
>>        pe-error-series-max="100" \
>>        pe-warn-series-max="100" \
>>        pe-input-series-max="100" \
>>        cluster-delay="10s" \
>>        last-lrm-refresh="1296433757" \
>>        cluster-recheck-interval="60s"
>> rsc_defaults $id="rsc-options" \
>>        failure-timeout="60s"
>> op_defaults $id="op_defaults-options" \
>>        timeout="5s"
>>
>> Thanks a lot in advance for any help!
>>
>> Best regards,
>>
>>        Anton
>> --
>> Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
>> Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
>> Linux NTFS maintainer, http://www.linux-ntfs.org/
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>