[Pacemaker] Possible error in RA invocation

Thu Jan 30 14:50:41 EST 2014

Hi everyone,

I am running a two-node cluster which hosts two Xen VMs. We're using 
DRBD, but it's managed directly from Xen.

The configuration of one of this resources is as follows:

primitive xen-vm1 ocf:heartbeat:Xen
         params xmfile="/etc/xen/vm1.cfg"
         op monitor interval="30s"
         op start interval="0" timeout="60s"
         op stop interval="0" timeout="300s"
         op migrate_from interval="0" timeout="240" ingerval="0"
         op migrate_to interval="0" timeout="240"
         meta allow-migrate="true" target-role="Started"
         meta target-role="Started"

I have a problem with the monitor operation. It seems to be working 
fine... until it doesn't. The cluster can be running for weeks without 
any failure, but sometimes the monitor operation fails with a really 
strange error from the resource agent. This is an excerpt of one of the 
failures:

Jan 28 14:40:20 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 monitor[71] 
(pid 11756)
Jan 28 14:40:20 xenhost1 lrmd: [3822]: info: operation monitor[71] on 
xen-vm1 for client 3825: pid 11756 exited with return code 0
Jan 28 15:40:26 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 monitor[71] 
(pid 18065)
Jan 28 15:40:27 xenhost1 lrmd: [3822]: info: operation monitor[71] on 
xen-vm1 for client 3825: pid 18065 exited with return code 0
Jan 28 16:40:32 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 monitor[71] 
(pid 24373)
Jan 28 16:40:32 xenhost1 lrmd: [3822]: info: operation monitor[71] on 
xen-vm1 for client 3825: pid 24373 exited with return code 0
Jan 28 17:40:38 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 monitor[71] 
(pid 30686)
Jan 28 17:40:38 xenhost1 lrmd: [3822]: info: operation monitor[71] on 
xen-vm1 for client 3825: pid 30686 exited with return code 0
Jan 28 18:40:44 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 monitor[71] 
(pid 4593)
Jan 28 18:40:44 xenhost1 lrmd: [3822]: info: operation monitor[71] on 
xen-vm1 for client 3825: pid 4593 exited with return code 0
Jan 28 18:55:23 xenhost1 lrmd: [3822]: info: RA output: 
(xen-vm1:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/Xen: 71: local:
Jan 28 18:55:23 xenhost1 lrmd: [3822]: info: RA output: 
(xen-vm1:monitor:stderr) en-list: bad variable name
Jan 28 18:55:23 xenhost1 lrmd: [3822]: info: RA output: 
(xen-vm1:monitor:stderr)
Jan 28 18:55:23 xenhost1 lrmd: [3822]: info: cancel_op: operation 
monitor[71] on xen-vm1 for client 3825, its parameters: 
crm_feature_set=[3.0.6] xmfile=[/etc/xen/vm1.cfg] 
CRM_meta_name=[monitor] CRM_meta_interval=[30000] 
CRM_meta_timeout=[20000]  cancelled
Jan 28 18:55:23 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 stop[72] (pid 6219)

The machines are very low on resources, and this unnecessary migration 
is causing problems.

The systems are running Debian Wheezy with pacemaker 1.1.7-1 and 
resource-agents 3.9.2-5+deb7u1. I don't know yet if there's a problem 
with the Xen RA, the lrmd service itself or my configuration. I wasn't 
able to find any information related to this issue. Do you have any idea 
of what could be causing this? Any help will be appreciated.

Regards,
Santiago