[ClusterLabs] failcount is not getiing reset after failure_timeout if monitoring is disabled

Tue May 23 15:00:33 CEST 2017

Hi,

We are running a two node cluster(Active(X)/passive(Y)) having muliple
resources of type IpAddr2.
Running monitor operations for multiple IPAddr2 resource is actually hoging
the cpu,
as we have configured very low value for monitor interval (200 msec).

To avoid this problem ,we are trying to use netlink notification for
monitoring floating Ip  and updating the failcount for the corresponding
Ipaddr2 resource using crm_failcount . Along with this we have disabled the
ipaddr2 monitoring.

Thing work fine till here as IPAddr2 resource migrates to other node(Y)
once failcount equals the migration threshold(1) and Y becomes Active due
to resource colocation constraints.

We have configured failure timeout to 3 sec and expected it to clear the
failcount on the initially active node(X).
Problem is that failcount never gets reset on X and thus cluster fails to
move back to X.

However if we enable the monitoring everything works fine and failcount
gets reset allowing to fallback.

Regrds,
Ashutosh T
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20170523/5e77a9cd/attachment.html>