[ClusterLabs] Debugging problems with resource timeout without any actions from cluster

Thu Oct 12 14:47:19 UTC 2017

On Thu, 2017-10-12 at 17:13 +0600, Sergey Korobitsin wrote:
> Hello,
> I experience some strange problem on MySQL resource agent from
> Percona:
> sometimes monitor operation for it killed by lrmd due to timeout,
> like
> this:
> 
> Oct 12 12:26:46 sde1 lrmd[14812]:  warning: p_mysql_monitor_5000
> process (PID 28991) timed out
> Oct 12 12:27:15 sde1 lrmd[14812]:  warning:
> p_mysql_monitor_5000:28991 - timed out after 20000ms
> Oct 12 12:27:15 sde1 crmd[14815]:    error: Result of monitor
> operation for p_mysql on sde1: Timed Out
> 
> Now I investigate the problem, but trouble is that no extraordinary
> DB
> load or something else like that was detected. But, when those
> timeouts
> happen, Pacemaker tries to move MySQL (and all resources colocated
> with
> it) to other node (I have two-noded cluster). For some reasons I have
> other node in standby mode now, and Pacemaker move resources back,
> restarting them. All this moving/restarting leads our services to be
> unavailable for some time, and this is unwanted.
> 
> So, my purpose is to get cluster with MySQL and other colocated
> resources up, but only with resource monitoring, and without
> starting,
> stopping, promoting, demoting resources, etc.
> 
> I found several ways to achieve that:
> 
> 1. Put cluster in maintainance mode (as described here:
>    https://www.hastexo.com/resources/hints-and-kinks/maintenance-acti
> ve-pacemaker-clusters/)
> 
>    As far as I understand, services will be monitored, all logs
> written,
>    etc., but no action in case of failures will be taken. Is that
> right?

Actually, maintenance mode stops all monitors (except those with
role=Stopped, which ensure a service is not running).

> 
> 2. Put the particular resource to unmanaged mode, as described here:
>    http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemak
> er_Explained/#s-monitoring-unmanaged

Disabling starts and stops is the exact purpose of unmanaged, so this
is one way to get what you want. FYI you can also set this as a global
default for all resources by setting it in the resource defaults
section of the configuration.

> 3. Start all resources and remove start and stop operations from
> them.

:-O

> Which is the best way to achieve my purpose? I would like cluster to
> run
> as usual (and logging as usual or with trace on problematic
> resource),
> but no action in case of monitor failure should be taken.

That's actually a different goal, also easily accomplished, by setting
on-fail=ignore on the monitor operation. From the sound of it, this is
closer to what you want, since the cluster is still allowed to
start/stop resources when you standby a node, etc.

You could also delete the recurring monitor operation from the
configuration, and it wouldn't run at all. But keeping it and setting
on-fail=ignore lets you see failures in cluster status.

However, I'm not sure bypassing the monitor is the best solution to
this problem. If the problem is simply that your database monitor can
legitimately take longer than 20 seconds in normal operation, then
raise the timeout as needed.

> Here is the configuration of MySQL resource:
> 
> primitive p_mysql ocf:percona:mysql \
>         params config="/etc/mysql/my.cnf"
> pid="/var/run/mysqld/mysqld.pid" socket="/var/run/mysqld/mysqld.sock"
> replication_user=slave_user replication_passwd=password
> max_slave_lag=180 evict_outdated_slaves=false
> binary="/usr/sbin/mysqld" test_user=test test_passwd=test \
>         op start interval=0 timeout=60s \
>         op stop interval=0 timeout=60s \
>         op monitor interval=5s role=Master OCF_CHECK_LEVEL=1 \
>         op monitor interval=2s role=Slave OCF_CHECK_LEVEL=1
> 
-- 
Ken Gaillot <kgaillot at redhat.com>