[ClusterLabs Developers] Failing operations immediately when node is known to be down

Wed Apr 18 14:55:52 EDT 2018

On Tue, 2018-04-17 at 15:42 -0500, Ryan Thomas wrote:
> I’m trying to implement a HA solution which recovers very quickly
> when a node fails.  It my configuration, when I reboot a node, I see
> in the logs that pacemaker realizes the node is down, and decides to
> move all resources to the surviving node.  To do this, it initiates a
> ‘stop’ operation on each of the resources to perform the move.  The
> ‘stop’ fails as expected after 20s (the default action timeout). 
> However, in this case, with the node known to be down,  I’d like to
> avoid this 20 second delay.  The node is known to be down, so any
> operations sent to the node will fail.  It would be nice if
> operations sent to a down node would immediately fail, thus reducing
> the time it takes the resource to be started on the surviving node. 
> I do not want to reduce the timeout for the operation, because the
> timeout is sensible for when a resource moves due to a non-node-
> failure.  Is there a way to accomplish this? 

I don't know if you've subscribed to the lists, but I sent this reply:

https://lists.clusterlabs.org/pipermail/users/2018-April/014862.html

Pacemaker doesn't try to stop anything on a node that is known to be
down, so we would need more details to figure out what's going on in
the above situation.
-- 
Ken Gaillot <kgaillot at redhat.com>