[ClusterLabs] Failing operations immediately when node is known to be down
Ken Gaillot
kgaillot at redhat.com
Fri Apr 13 14:35:34 UTC 2018
On Tue, 2018-04-10 at 12:56 -0500, Ryan Thomas wrote:
> I’m trying to implement a HA solution which recovers very quickly
> when a node fails. It my configuration, when I reboot a node, I see
> in the logs that pacemaker realizes the node is down, and decides to
> move all resources to the surviving node. To do this, it initiates a
> ‘stop’ operation on each of the resources to perform the move. The
> ‘stop’ fails as expected after 20s (the default action timeout).
> However, in this case, with the node known to be down, I’d like to
> avoid this 20 second delay. The node is known to be down, so any
> operations sent to the node will fail. It would be nice if
> operations sent to a down node would immediately fail, thus reducing
> the time it takes the resource to be started on the surviving node.
> I do not want to reduce the timeout for the operation, because the
> timeout is sensible for when a resource moves due to a non-node-
> failure. Is there a way to accomplish this?
>
> Thanks for your help.
How are you rebooting -- cleanly (normal shutdown) or simulating a
failure (e.g. power button)?
In a normal shutdown, pacemaker will move all resources off the node
before it shuts down. These operations shouldn't fail, because the node
isn't down yet.
When a node fails, corosync should detect this and notify pacemaker.
Pacemaker will not try to execute any operations on a failed node.
Instead, it will fence it.
What log messages do you see from corosync and pacemaker indicating
that the node is down? Do you have fencing configured and tested?
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list