[Pacemaker] mysql ocf resource agent - resource stays unmanaged if binary unavailable

Fri May 17 23:51:32 EDT 2013

On 18/05/2013, at 6:49 AM, Andreas Kurz <andreas at hastexo.com> wrote:

> On 2013-05-17 00:24, Vladimir wrote:
>> Hi,
>> 
>> our pacemaker setup provides mysql resource using ocf resource agent.
>> Today I tested with my colleagues forcing mysql resource to fail. I
>> don't understand the following behaviour. When I remove the mysqld_safe
>> binary (which path is specified in crm config) from one server and
>> moving the mysql resource to this server, the resource will not fail
>> back and stays in the "unmanaged" status. We can see that the function
>> check_binary(); is called within the mysql ocf resource agent and
>> exists with error code "5". The fail-count gets raised to INFINITY and
>> pacemaker tries to "stop" the resource fails. This results in a
>> "unmanaged" status.
>> 
>> How to reproduce:
>> 
>> 1. mysql resource is running on node1
>> 2. on node2 mv /usr/bin/mysqld_safe{,.bak}
>> 3. crm resource move group-MySQL node2
>> 4. observe corosync.log and crm_mon
>> 
>> # cat /var/log/corosync/corosync.log
>> [...]
>> May 16 10:53:41 node2 lrmd: [1893]: info: operation start[119] on
>> res-MySQL-IP1 for client 1896: pid 5137 exited with return code 0 May
>> 16 10:53:41 node2 crmd: [1896]: info: process_lrm_event: LRM operation
>> res-MySQL-IP1_start_0 (call=119, rc=0, cib-update=98, confirmed=true)
>> ok May 16 10:53:41 node2 crmd: [1896]: info: do_lrm_rsc_op: Performing
>> key=94:102:0:28dea763-d2a2-4b9d-b86a-5357760ed16e
>> op=res-MySQL-IP1_monitor_30000 ) May 16 10:53:41 node2 lrmd: [1893]:
>> info: rsc:res-MySQL-IP1 monitor[120] (pid 5222) May 16 10:53:41 node2
>> crmd: [1896]: info: do_lrm_rsc_op: Performing
>> key=96:102:0:28dea763-d2a2-4b9d-b86a-5357760ed16e op=res-MySQL_start_0
>> ) May 16 10:53:41 node2 lrmd: [1893]: info: rsc:res-MySQL start[121]
>> (pid 5223) May 16 10:53:41 node2 lrmd: [1893]: info: RA output:
>> (res-MySQL:start:stderr) 2013/05/16_10:53:41 ERROR: Setup problem:
>> couldn't find command: /usr/bin/mysqld_safe
>> 
>> May 16 10:53:41 node2 lrmd: [1893]: info: operation start[121] on
>> res-MySQL for client 1896: pid 5223 exited with return code 5 May 16
>> 10:53:41 node2 crmd: [1896]: info: process_lrm_event: LRM operation
>> res-MySQL_start_0 (call=121, rc=5, cib-update=99, confirmed=true) not
>> installed May 16 10:53:41 node2 lrmd: [1893]: info: operation
>> monitor[120] on res-MySQL-IP1 for client 1896: pid 5222 exited with
>> return code 0 May 16 10:53:41 node2 crmd: [1896]: info:
>> process_lrm_event: LRM operation res-MySQL-IP1_monitor_30000 (call=120,
>> rc=0, cib-update=100, confirmed=false) ok May 16 10:53:41 node2 attrd:
>> [1894]: notice: attrd_ais_dispatch: Update relayed from node1 May 16
>> 10:53:41 node2 attrd: [1894]: notice: attrd_trigger_update: Sending
>> flush op to all hosts for: fail-count-res-MySQL (INFINITY) May 16
>> 10:53:41 node2 attrd: [1894]: notice: attrd_perform_update: Sent update
>> 44: fail-count-res-MySQL=INFINITY May 16 10:53:41 node2 attrd: [1894]:
>> notice: attrd_ais_dispatch: Update relayed from node1 May 16 10:53:41
>> node2 attrd: [1894]: notice: attrd_trigger_update: Sending flush op to
>> all hosts for: last-failure-res-MySQL (1368694421) May 16 10:53:41
>> node2 attrd: [1894]: notice: attrd_perform_update: Sent update 47:
>> last-failure-res-MySQL=1368694421 May 16 10:53:41 node2 lrmd: [1893]:
>> info: cancel_op: operation monitor[117] on res-DRBD-MySQL:1 for client
>> 1896, its parameters: drbd_resource=[mysql] CRM_meta_role=[Master]
>> CRM_meta_timeout=[20000] CRM_meta_name=[monitor]
>> crm_feature_set=[3.0.5] CRM_meta_notify=[true]
>> CRM_meta_clone_node_max=[1] CRM_meta_clone=[1] CRM_meta_clone_max=[2]
>> CRM_meta_master_node_max=[1] CRM_meta_interval=[29000]
>> CRM_meta_globally_unique=[false] CRM_meta_master_max=[1]  cancelled May
>> 16 10:53:41 node2 crmd: [1896]: info: send_direct_ack: ACK'ing resource
>> op res-DRBD-MySQL:1_monitor_29000 from
>> 3:104:0:28dea763-d2a2-4b9d-b86a-5357760ed16e:
>> lrm_invoke-lrmd-1368694421-57 May 16 10:53:41 node2 crmd: [1896]: info:
>> do_lrm_rsc_op: Performing
>> key=8:104:0:28dea763-d2a2-4b9d-b86a-5357760ed16e op=res-MySQL_stop_0 )
>> May 16 10:53:41 node2 lrmd: [1893]: info: rsc:res-MySQL stop[122] (pid
>> 5278) [...]
>> 
>> I can not figure out why the fail-count gets raised to INFINITY and
>> especially why pacemaker tries to stop the resource after failing.
>> Shouldn't it be the best for the resource to fail back to another node
>> instead of resulting in a "unmanaged" status on the node? is it
>> possible to force this behavior in any way?
> 
> By default start-failures are fatal and raising the fail-count to
> INFINITY disallows future starts on this node unless the resource and so
> its fail-count is cleaned.
> 
> On a start failure Pacemaker tries to stop the resource to be sure it is
> really not started or somewhere in-between ... stop fails also in your
> case and cluster stucks and sets the resource into unmanaged mode.
> 
> Why? Because you obviously have no stonith configured that could make
> sure the resource is really stopped by fencing that node.
> 
> Solution for your problem: correctly configure stonith and enable it in
> your cluster

Well, that and fix the mysql problem(s) that caused the original error(s).

> 
> Best regards,
> Andreas
> 
> -- 
> Need help with Pacemaker?
> http://www.hastexo.com/now
> 
>> 
>> Here some specs of the software used on our cluster nodes:
>> 
>> node1:~# lsb_release -d && dpkg -l pacemaker | awk '/ii/{print $2,$3}'
>> && uname -ri Description:    Ubuntu 12.04.2 LTS
>> pacemaker 1.1.6-2ubuntu3
>> 3.2.0-41-generic x86_64
>> 
>> Best regards
>> Vladimir
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>> 
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org