[ClusterLabs] Failcount not resetting to zero after failure-timeout

Pritam Kharat pritam.kharat at oneconvergence.com
Thu Nov 19 16:58:19 UTC 2015


Hi All,

I have 2 node HA setup. I have added migration_threshold=5 and
failure-timeout=120s for my resources. When migration threshold is reached
to 5 resources are migrated to other node. But once observed fail-count is
not reset back to zero after 2 mins. The setup was in the same state almost
for 3 hours but still fail-count did not reset to zero.

Then I tried the same test again but could not reproduce this.When compared
the logs of success scenario with failed scenario found that pengine did
not take action to clear failcount.



Success logs
*Nov 19 15:27:08 [16409] sc-node-1    pengine:   notice: unpack_rsc_op:
 Clearing expired failcount for oc-service-manager on sc-node-1*
Nov 19 15:27:08 [16409] sc-node-1    pengine:     info: get_failcount_full:
    oc-service-manager has failed 5 times on sc-node-1
Nov 19 15:27:08 [16409] sc-node-1    pengine:   notice: unpack_rsc_op:
 Clearing expired failcount for oc-service-manager on sc-node-1
Nov 19 15:27:08 [16409] sc-node-1    pengine:   notice: unpack_rsc_op:
 Re-initiated expired calculated failure oc-service-manager_last_failure_0
(rc=7, magic=0:7;3:145:0:258ae879-832f-4126-a7d7-e57bd3fdcdb1) on sc-node-1
4:58 PM


Failure logs
Nov 04 22:23:39 [6831] sc-HA2    pengine:  warning: unpack_rsc_op:
 Processing failed op monitor for oc-service-manager on sc-HA1: not running
(7)
Nov 04 22:23:39 [6831] sc-HA2    pengine:     info: native_print:
oc-service-manager      (upstart:oc-service-manager):   Started sc-HA2
*Nov 04 22:23:39 [6831] sc-HA2    pengine:     info: get_failcount_full:
      oc-service-manager has failed 5 times on sc-HA1*
Nov 04 22:23:39 [6831] sc-HA2    pengine:  warning: common_apply_stickiness:
   Forcing oc-service-manager away from sc-HA1 after 5 failures (max=5)
Nov 04 22:23:39 [6831] sc-HA2    pengine:     info: rsc_merge_weights:
 oc-service-manager: Rolling back scores from oc-fw-agent
Nov 04 22:23:39 [6831] sc-HA2    pengine:     info: LogActions:
Leave   oc-service-manager      (Started sc-HA2)


What might be the reason of - in failure case this action did not take
place ?
*notice: unpack_rsc_op:  Clearing expired failcount for oc-service-manager *


-- 
Thanks and Regards,
Pritam Kharat.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20151119/e9464c90/attachment-0003.html>


More information about the Users mailing list