[ClusterLabs] Timeout before fencing/stonith happens
kgaillot at redhat.com
Wed Sep 2 10:34:04 EDT 2015
On 09/02/2015 01:14 AM, Nicolas S. wrote:
> I write this mailing-list because I'm having a little trouble with my cluster.
> I'm running a 3 node centos 7 cluster. Resource/stonith and all is configured.
> Since a couple of days my backups take a little bit more time and one node is getting high load. At a certain point it's fenced by the other nodes.
> Of course I'm thinking of correcting that backup issue but for the moment it's not done.
> I tried to find in the docs a general property to make the nodes to wait a little bit more before fencing the node, but i didn't really found it (or I misread docs). It seems that the stonith-timeout isn't the solutiion.
> Is there a best practice or such property ?
> Thanks to all.
Check your logs to determine what is causing the fencing, then increase
the timeout for that. The logs on the node that is DC at the time will
be most useful (if you're investigating after the fact, just look for
the node with the most verbose logs around that time).
As Ulrich suggested, most likely a monitor operation for one or more
resources is timing out, and increasing the monitor timeouts will help.
Of course, that may also increase the time needed to detect a real
failure, so set it back after fixing the underlying issue.
But it could be something else such as corosync, so the logs are the key.
stonith-timeout is for how long the fencing operation itself should
take. Once that starts ticking, the fencing has already been started, so
that's why it doesn't help you here.
If you want to get really fancy, you could use rules to automatically
move some or all resources off the node during backup times:
More information about the Users