[Pacemaker] mysql ocf resource agent - resource stays unmanaged if binary unavailable

Andreas Kurz andreas at hastexo.com
Fri May 17 16:49:06 EDT 2013


On 2013-05-17 00:24, Vladimir wrote:
> Hi,
> 
> our pacemaker setup provides mysql resource using ocf resource agent.
> Today I tested with my colleagues forcing mysql resource to fail. I
> don't understand the following behaviour. When I remove the mysqld_safe
> binary (which path is specified in crm config) from one server and
> moving the mysql resource to this server, the resource will not fail
> back and stays in the "unmanaged" status. We can see that the function
> check_binary(); is called within the mysql ocf resource agent and
> exists with error code "5". The fail-count gets raised to INFINITY and
> pacemaker tries to "stop" the resource fails. This results in a
> "unmanaged" status.
> 
> How to reproduce:
> 
> 1. mysql resource is running on node1
> 2. on node2 mv /usr/bin/mysqld_safe{,.bak}
> 3. crm resource move group-MySQL node2
> 4. observe corosync.log and crm_mon
> 
> # cat /var/log/corosync/corosync.log
> [...]
> May 16 10:53:41 node2 lrmd: [1893]: info: operation start[119] on
> res-MySQL-IP1 for client 1896: pid 5137 exited with return code 0 May
> 16 10:53:41 node2 crmd: [1896]: info: process_lrm_event: LRM operation
> res-MySQL-IP1_start_0 (call=119, rc=0, cib-update=98, confirmed=true)
> ok May 16 10:53:41 node2 crmd: [1896]: info: do_lrm_rsc_op: Performing
> key=94:102:0:28dea763-d2a2-4b9d-b86a-5357760ed16e
> op=res-MySQL-IP1_monitor_30000 ) May 16 10:53:41 node2 lrmd: [1893]:
> info: rsc:res-MySQL-IP1 monitor[120] (pid 5222) May 16 10:53:41 node2
> crmd: [1896]: info: do_lrm_rsc_op: Performing
> key=96:102:0:28dea763-d2a2-4b9d-b86a-5357760ed16e op=res-MySQL_start_0
> ) May 16 10:53:41 node2 lrmd: [1893]: info: rsc:res-MySQL start[121]
> (pid 5223) May 16 10:53:41 node2 lrmd: [1893]: info: RA output:
> (res-MySQL:start:stderr) 2013/05/16_10:53:41 ERROR: Setup problem:
> couldn't find command: /usr/bin/mysqld_safe
> 
> May 16 10:53:41 node2 lrmd: [1893]: info: operation start[121] on
> res-MySQL for client 1896: pid 5223 exited with return code 5 May 16
> 10:53:41 node2 crmd: [1896]: info: process_lrm_event: LRM operation
> res-MySQL_start_0 (call=121, rc=5, cib-update=99, confirmed=true) not
> installed May 16 10:53:41 node2 lrmd: [1893]: info: operation
> monitor[120] on res-MySQL-IP1 for client 1896: pid 5222 exited with
> return code 0 May 16 10:53:41 node2 crmd: [1896]: info:
> process_lrm_event: LRM operation res-MySQL-IP1_monitor_30000 (call=120,
> rc=0, cib-update=100, confirmed=false) ok May 16 10:53:41 node2 attrd:
> [1894]: notice: attrd_ais_dispatch: Update relayed from node1 May 16
> 10:53:41 node2 attrd: [1894]: notice: attrd_trigger_update: Sending
> flush op to all hosts for: fail-count-res-MySQL (INFINITY) May 16
> 10:53:41 node2 attrd: [1894]: notice: attrd_perform_update: Sent update
> 44: fail-count-res-MySQL=INFINITY May 16 10:53:41 node2 attrd: [1894]:
> notice: attrd_ais_dispatch: Update relayed from node1 May 16 10:53:41
> node2 attrd: [1894]: notice: attrd_trigger_update: Sending flush op to
> all hosts for: last-failure-res-MySQL (1368694421) May 16 10:53:41
> node2 attrd: [1894]: notice: attrd_perform_update: Sent update 47:
> last-failure-res-MySQL=1368694421 May 16 10:53:41 node2 lrmd: [1893]:
> info: cancel_op: operation monitor[117] on res-DRBD-MySQL:1 for client
> 1896, its parameters: drbd_resource=[mysql] CRM_meta_role=[Master]
> CRM_meta_timeout=[20000] CRM_meta_name=[monitor]
> crm_feature_set=[3.0.5] CRM_meta_notify=[true]
> CRM_meta_clone_node_max=[1] CRM_meta_clone=[1] CRM_meta_clone_max=[2]
> CRM_meta_master_node_max=[1] CRM_meta_interval=[29000]
> CRM_meta_globally_unique=[false] CRM_meta_master_max=[1]  cancelled May
> 16 10:53:41 node2 crmd: [1896]: info: send_direct_ack: ACK'ing resource
> op res-DRBD-MySQL:1_monitor_29000 from
> 3:104:0:28dea763-d2a2-4b9d-b86a-5357760ed16e:
> lrm_invoke-lrmd-1368694421-57 May 16 10:53:41 node2 crmd: [1896]: info:
> do_lrm_rsc_op: Performing
> key=8:104:0:28dea763-d2a2-4b9d-b86a-5357760ed16e op=res-MySQL_stop_0 )
> May 16 10:53:41 node2 lrmd: [1893]: info: rsc:res-MySQL stop[122] (pid
> 5278) [...]
> 
> I can not figure out why the fail-count gets raised to INFINITY and
> especially why pacemaker tries to stop the resource after failing.
> Shouldn't it be the best for the resource to fail back to another node
> instead of resulting in a "unmanaged" status on the node? is it
> possible to force this behavior in any way?

By default start-failures are fatal and raising the fail-count to
INFINITY disallows future starts on this node unless the resource and so
its fail-count is cleaned.

On a start failure Pacemaker tries to stop the resource to be sure it is
really not started or somewhere in-between ... stop fails also in your
case and cluster stucks and sets the resource into unmanaged mode.

Why? Because you obviously have no stonith configured that could make
sure the resource is really stopped by fencing that node.

Solution for your problem: correctly configure stonith and enable it in
your cluster

Best regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> Here some specs of the software used on our cluster nodes:
> 
> node1:~# lsb_release -d && dpkg -l pacemaker | awk '/ii/{print $2,$3}'
> && uname -ri Description:    Ubuntu 12.04.2 LTS
> pacemaker 1.1.6-2ubuntu3
> 3.2.0-41-generic x86_64
> 
> Best regards
> Vladimir
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 



-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 287 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130517/ad8f277b/attachment-0003.sig>


More information about the Pacemaker mailing list