[ClusterLabs] Failcount not resetting to zero after failure-timeout

Mon Nov 23 05:36:09 UTC 2015

Could some one please reply ?

On Thu, Nov 19, 2015 at 10:28 PM, Pritam Kharat <
pritam.kharat at oneconvergence.com> wrote:

>
> Hi All,
>
> I have 2 node HA setup. I have added migration_threshold=5 and
> failure-timeout=120s for my resources. When migration threshold is reached
> to 5 resources are migrated to other node. But once observed fail-count is
> not reset back to zero after 2 mins. The setup was in the same state almost
> for 3 hours but still fail-count did not reset to zero.
>
> Then I tried the same test again but could not reproduce this.When
> compared the logs of success scenario with failed scenario found that
> pengine did not take action to clear failcount.
>
>
>
> Success logs
> *Nov 19 15:27:08 [16409] sc-node-1    pengine:   notice: unpack_rsc_op:
>  Clearing expired failcount for oc-service-manager on sc-node-1*
> Nov 19 15:27:08 [16409] sc-node-1    pengine:     info: get_failcount_full:
>     oc-service-manager has failed 5 times on sc-node-1
> Nov 19 15:27:08 [16409] sc-node-1    pengine:   notice: unpack_rsc_op:
>  Clearing expired failcount for oc-service-manager on sc-node-1
> Nov 19 15:27:08 [16409] sc-node-1    pengine:   notice: unpack_rsc_op:
>  Re-initiated expired calculated failure oc-service-manager_last_failure_0
> (rc=7, magic=0:7;3:145:0:258ae879-832f-4126-a7d7-e57bd3fdcdb1) on
> sc-node-1
> 4:58 PM
>
>
> Failure logs
> Nov 04 22:23:39 [6831] sc-HA2    pengine:  warning: unpack_rsc_op:
>  Processing failed op monitor for oc-service-manager on sc-HA1: not
> running (7)
> Nov 04 22:23:39 [6831] sc-HA2    pengine:     info: native_print:
> oc-service-manager      (upstart:oc-service-manager):   Started sc-HA2
> *Nov 04 22:23:39 [6831] sc-HA2    pengine:     info: get_failcount_full:
>       oc-service-manager has failed 5 times on sc-HA1*
> Nov 04 22:23:39 [6831] sc-HA2    pengine:  warning: common_apply_stickiness:
>    Forcing oc-service-manager away from sc-HA1 after 5 failures (max=5)
> Nov 04 22:23:39 [6831] sc-HA2    pengine:     info: rsc_merge_weights:
>  oc-service-manager: Rolling back scores from oc-fw-agent
> Nov 04 22:23:39 [6831] sc-HA2    pengine:     info: LogActions:
> Leave   oc-service-manager      (Started sc-HA2)
>
>
> What might be the reason of - in failure case this action did not take
> place ?
> *notice: unpack_rsc_op:  Clearing expired failcount for
> oc-service-manager *
>
>
> --
> Thanks and Regards,
> Pritam Kharat.
>

-- 
Thanks and Regards,
Pritam Kharat.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20151123/b2f44f84/attachment.htm>