[ClusterLabs] Pacemaker not restarting Resource on same node

Thu Jun 28 17:37:45 UTC 2018

On Thu, 2018-06-28 at 19:58 +0300, Andrei Borzenkov wrote:
> 28.06.2018 18:35, Dileep V Nair пишет:
> > 
> > 
> > Hi,
> > 
> > 	I have a cluster with DB2 running in HADR mode. I have used the
> > db2
> > resource agent. My problem is whenever DB2 fails on primary it is
> > migrating
> > to the secondary node. Ideally it should restart thrice (Migration
> > Threshold set to 3) but not happening. This is causing extra
> > downtime for
> > customer. Is there any other settings / parameters which needs to
> > be set.
> > Did anyone face similar issue ? I am on pacemaker version 1.1.15-
> > 21.1.
> > 
> 
> It is impossible to answer without good knowledge of application and
> resource agent. From quick look at resource agent, it removes master
> score from current node if database failure is detected which means
> current node will not be eligible for fail-over.
> 
> Note that pacemaker does not really have concept of "restarting
> resource
> on the same node". Every time it performs full node selection using
> current scores. It usually happens to be "same node" simply due to
> non-zero resource stickiness by default. You could attempt to adjust
> stickiness so that final score will be larger than master score on
> standby. But that also needs agent cooperation - are you sure agent
> will
> even attempt to restart failed master locally?

Also, some types of errors cannot be recovered by a restart on the same
node.

For example, by default, start failures will not be retried on the same
node (see the cluster property start-failure-is-fatal), to avoid a
repeatedly failing start preventing the cluster from doing anything
else. Certain OCF resource agent exit codes are considered "hard"
errors that prevent retrying on the same node: missing dependencies,
file permission errors, etc.
-- 
Ken Gaillot <kgaillot at redhat.com>