[ClusterLabs] number of attemps

Fri May 4 11:41:49 EDT 2018

On Fri, 2018-05-04 at 15:46 +0200, alessandro.parodi at softeco.it wrote:
> Hi, I have a problem with a cluster Pacemaker (0.9.158) Corosync
> (Corosync Cluster Engine 2.4.0), composed by two servers (Oracle
> Cloud) with Oracle Linux Server 7.4. 
> On one of the two node (for example node1), a service seems to fail a
> great number of times, until exhaust the counter of attempts. 
> At this point, correctly, the service  is activated on the other node
> (node2).
> If appens a new change of server (for example in case of shutdown of
> the node2), on the node1 Pacemeker doesn't try to restart the
> service. It doesn't apparently reset the number of failed attempts.
> The situation is restored only following the cleanup (pcs resource
> cleanup).
> There is any solution? Is possible to tell to pacemaker that need to
> reset the number of failed attempts when, for example, the resource
> is activated on the other node? 
> 
> Thanks, alex

You can clean failures manually, or set the failure-timeout resource
meta-attribute (which can be set on a particular resource, or for all
resources via rsc_defaults). The failure-timeout (as you might expect)
works by automatically cleaning the failure after a certain amount of
time has passed, not when a particular event occurs (such as a start on
another node).

Once a failure is cleaned, that node becomes eligible to run the
resource again, and (depending on stickiness and so forth) the cluster
may choose to move the resource back to that node. That's one reason
failures aren't automatically cleaned after a successful start
elsewhere. Also, keeping the failure allows an administrator to notice
that something went wrong, and manually investigate before allowing the
node to host the resource again.
-- 
Ken Gaillot <kgaillot at redhat.com>