[ClusterLabs] Resource failure-timeout does not reset when resource fails to connect to both nodes

Sam Gardner SGardner at trustwave.com
Mon Mar 28 12:44:30 EDT 2016

I have a simple resource defined:

[root at ha-d1 ~]# pcs resource show dmz1
 Resource: dmz1 (class=ocf provider=internal type=ip-address)
  Attributes: address= monitor_link=true
  Meta Attrs: migration-threshold=3 failure-timeout=30s
  Operations: monitor interval=7s (dmz1-monitor-interval-7s)

This is a custom resource which provides an ethernet alias to one of the interfaces on our system.

I can unplug the cable on either node and failover occurs as expected, and 30s after re-plugging it I can repeat the exercise on the opposite node and failover will happen as expected.

However, if I unplug the cable from both nodes, the failcount goes up, and the 30s failure-timeout does not reset the failcounts, meaning that pacemaker never tries to start the failed resource again.

Full list of resources:

 Resource Group: network
     inif       (off::internal:ip.sh):       Started ha-d1.dev.com
     outif      (off::internal:ip.sh):       Started ha-d2.dev.com
     dmz1       (off::internal:ip.sh):       Stopped
 Master/Slave Set: DRBDMaster [DRBDSlave]
     Masters: [ ha-d1.dev.com ]
     Slaves: [ ha-d2.dev.com ]
 Resource Group: filesystem
     DRBDFS     (ocf::heartbeat:Filesystem):    Stopped
 Resource Group: application
     service_failover   (off::internal:service_failover):    Stopped

Failcounts for dmz1
 ha-d1.dev.com: 4
 ha-d2.dev.com: 4

Is there any way to automatically recover from this scenario, other than setting an obnoxiously high migration-threshold?

Sam Gardner
Software Engineer


This transmission may contain information that is privileged, confidential, and/or exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, or use of the information contained herein (including any reliance thereon) is strictly prohibited. If you received this transmission in error, please immediately contact the sender and destroy the material in its entirety, whether in electronic or hard copy format.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20160328/dd1c3cd8/attachment-0002.html>

More information about the Users mailing list