[Pacemaker] mysql ocf resource agent - resource stays unmanaged if binary unavailable

Thu May 16 18:24:38 EDT 2013

Hi,

our pacemaker setup provides mysql resource using ocf resource agent.
Today I tested with my colleagues forcing mysql resource to fail. I
don't understand the following behaviour. When I remove the mysqld_safe
binary (which path is specified in crm config) from one server and
moving the mysql resource to this server, the resource will not fail
back and stays in the "unmanaged" status. We can see that the function
check_binary(); is called within the mysql ocf resource agent and
exists with error code "5". The fail-count gets raised to INFINITY and
pacemaker tries to "stop" the resource fails. This results in a
"unmanaged" status.

How to reproduce:

1. mysql resource is running on node1
2. on node2 mv /usr/bin/mysqld_safe{,.bak}
3. crm resource move group-MySQL node2
4. observe corosync.log and crm_mon

# cat /var/log/corosync/corosync.log
[...]
May 16 10:53:41 node2 lrmd: [1893]: info: operation start[119] on
res-MySQL-IP1 for client 1896: pid 5137 exited with return code 0 May
16 10:53:41 node2 crmd: [1896]: info: process_lrm_event: LRM operation
res-MySQL-IP1_start_0 (call=119, rc=0, cib-update=98, confirmed=true)
ok May 16 10:53:41 node2 crmd: [1896]: info: do_lrm_rsc_op: Performing
key=94:102:0:28dea763-d2a2-4b9d-b86a-5357760ed16e
op=res-MySQL-IP1_monitor_30000 ) May 16 10:53:41 node2 lrmd: [1893]:
info: rsc:res-MySQL-IP1 monitor[120] (pid 5222) May 16 10:53:41 node2
crmd: [1896]: info: do_lrm_rsc_op: Performing
key=96:102:0:28dea763-d2a2-4b9d-b86a-5357760ed16e op=res-MySQL_start_0
) May 16 10:53:41 node2 lrmd: [1893]: info: rsc:res-MySQL start[121]
(pid 5223) May 16 10:53:41 node2 lrmd: [1893]: info: RA output:
(res-MySQL:start:stderr) 2013/05/16_10:53:41 ERROR: Setup problem:
couldn't find command: /usr/bin/mysqld_safe

May 16 10:53:41 node2 lrmd: [1893]: info: operation start[121] on
res-MySQL for client 1896: pid 5223 exited with return code 5 May 16
10:53:41 node2 crmd: [1896]: info: process_lrm_event: LRM operation
res-MySQL_start_0 (call=121, rc=5, cib-update=99, confirmed=true) not
installed May 16 10:53:41 node2 lrmd: [1893]: info: operation
monitor[120] on res-MySQL-IP1 for client 1896: pid 5222 exited with
return code 0 May 16 10:53:41 node2 crmd: [1896]: info:
process_lrm_event: LRM operation res-MySQL-IP1_monitor_30000 (call=120,
rc=0, cib-update=100, confirmed=false) ok May 16 10:53:41 node2 attrd:
[1894]: notice: attrd_ais_dispatch: Update relayed from node1 May 16
10:53:41 node2 attrd: [1894]: notice: attrd_trigger_update: Sending
flush op to all hosts for: fail-count-res-MySQL (INFINITY) May 16
10:53:41 node2 attrd: [1894]: notice: attrd_perform_update: Sent update
44: fail-count-res-MySQL=INFINITY May 16 10:53:41 node2 attrd: [1894]:
notice: attrd_ais_dispatch: Update relayed from node1 May 16 10:53:41
node2 attrd: [1894]: notice: attrd_trigger_update: Sending flush op to
all hosts for: last-failure-res-MySQL (1368694421) May 16 10:53:41
node2 attrd: [1894]: notice: attrd_perform_update: Sent update 47:
last-failure-res-MySQL=1368694421 May 16 10:53:41 node2 lrmd: [1893]:
info: cancel_op: operation monitor[117] on res-DRBD-MySQL:1 for client
1896, its parameters: drbd_resource=[mysql] CRM_meta_role=[Master]
CRM_meta_timeout=[20000] CRM_meta_name=[monitor]
crm_feature_set=[3.0.5] CRM_meta_notify=[true]
CRM_meta_clone_node_max=[1] CRM_meta_clone=[1] CRM_meta_clone_max=[2]
CRM_meta_master_node_max=[1] CRM_meta_interval=[29000]
CRM_meta_globally_unique=[false] CRM_meta_master_max=[1]  cancelled May
16 10:53:41 node2 crmd: [1896]: info: send_direct_ack: ACK'ing resource
op res-DRBD-MySQL:1_monitor_29000 from
3:104:0:28dea763-d2a2-4b9d-b86a-5357760ed16e:
lrm_invoke-lrmd-1368694421-57 May 16 10:53:41 node2 crmd: [1896]: info:
do_lrm_rsc_op: Performing
key=8:104:0:28dea763-d2a2-4b9d-b86a-5357760ed16e op=res-MySQL_stop_0 )
May 16 10:53:41 node2 lrmd: [1893]: info: rsc:res-MySQL stop[122] (pid
5278) [...]

I can not figure out why the fail-count gets raised to INFINITY and
especially why pacemaker tries to stop the resource after failing.
Shouldn't it be the best for the resource to fail back to another node
instead of resulting in a "unmanaged" status on the node? is it
possible to force this behavior in any way?

Here some specs of the software used on our cluster nodes:

node1:~# lsb_release -d && dpkg -l pacemaker | awk '/ii/{print $2,$3}'
&& uname -ri Description:    Ubuntu 12.04.2 LTS
pacemaker 1.1.6-2ubuntu3
3.2.0-41-generic x86_64

Best regards
Vladimir