[ClusterLabs] ocf-tester always claims failure, even with built-in resource agents?

Antony Stone Antony.Stone at ha.open.source.it
Fri Mar 26 10:28:24 EDT 2021


Hi.

I've just signed up to the list.  I've been using corosync and pacemaker for 
several years, mostly under Debian 9, which means:

	corosync 2.4.2
	pacemaker 1.1.16

I've recently upgraded a test cluster to Debian 10, which gives me:

	corosync 3.0.1
	pacemaker 2.0.1

I've made a few adjustments to my /etc/corosync/corosync.conf configuration so 
that corosync seems happy, and also some minor changes (mostly to the cluster 
defaults) in /etc/corosync/cluster.cib so that pacemaker is happy.

So far all is well and good, my cluster synchronises, starts the resources, 
and everything's working as expected.  It'll move the resources from one 
cluster member to another (either if I ask it to, or if there's a problem), 
and it seems to work just as the older version did.

Then, several times a day, I get resource failures such as:

	* Asterisk_start_0 on castor 'insufficient privileges' (4):
	 call=58,
	 status=complete,
	 exitreason='',
	 last-rc-change='Fri Mar 26 13:37:08 2021',
	 queued=0ms,
	 exec=55ms

I have no idea why the machine might tell me it cannot start Asterisk due to 
insufficient privilege when it's already been able to run it before the cluster 
resources moved back to this machine.  Asterisk *can* and *does* run on this 
machine.

Another error I get is:

	* Kann-Bear_monitor_5000 on helen 'unknown error' (1):
	 call=62,
	 status=complete,
	 exitreason='',
	 last-rc-change='Fri Mar 26 14:23:05 2021',
	 queued=0ms,
	 exec=0ms

Now, that second resource is one which doesn't have a standard resource agent 
available for it under /usr/lib/ocf/resource.d, so I'm using the general-
purpose agent /usr/lib/ocf/resource.d/heartbeat/anything to manage it.

I thought, "perhaps there's something dodgy about using this 'anything' agent, 
because it can't really know about the resource it's managing", so I tested it 
with ocf-tester:

# ocf-tester -n Kann-Bear -o binfile="/usr/sbin/bearerbox" -o 
cmdline_options="/etc/kannel/kannel.conf" -o 
pidfile="/var/run/kannel/kannel_bearerbox.pid" 
/usr/lib/ocf/resource.d/heartbeat/anything
Beginning tests for /usr/lib/ocf/resource.d/heartbeat/anything...
/usr/sbin/ocf-tester: 226: /usr/sbin/ocf-tester: xmllint: not found
* rc=127: Your agent produces meta-data which does not conform to ra-api-1.dtd
* Your agent does not support the notify action (optional)
* Your agent does not support the demote action (optional)
* Your agent does not support the promote action (optional)
* Your agent does not support master/slave (optional)
* Your agent does not support the reload action (optional)
Tests failed: /usr/lib/ocf/resource.d/heartbeat/anything failed 1 tests

Okay, something's not right.

BUT, it doesn't matter *which* resource agent I test, it tells me the same 
thing every time, including for the built-in standard agents:

* rc=127: Your agent produces meta-data which does not conform to ra-api-1.dtd

For example:

# ocf-tester -n Asterisk /usr/lib/ocf/resource.d/heartbeat/asterisk
Beginning tests for /usr/lib/ocf/resource.d/heartbeat/asterisk...
/usr/sbin/ocf-tester: 226: /usr/sbin/ocf-tester: xmllint: not found
* rc=127: Your agent produces meta-data which does not conform to ra-api-1.dtd
* Your agent does not support the notify action (optional)
* Your agent does not support the demote action (optional)
* Your agent does not support the promote action (optional)
* Your agent does not support master/slave (optional)
* Your agent does not support the reload action (optional)
Tests failed: /usr/lib/ocf/resource.d/heartbeat/asterisk failed 1 tests


# ocf-tester -n IP-Float4 -o ip=10.1.0.42 -o cidr_netmask=28 
/usr/lib/ocf/resource.d/heartbeat/IPaddr2
Beginning tests for /usr/lib/ocf/resource.d/heartbeat/IPaddr2...
/usr/sbin/ocf-tester: 226: /usr/sbin/ocf-tester: xmllint: not found
* rc=127: Your agent produces meta-data which does not conform to ra-api-1.dtd
* Your agent does not support the notify action (optional)
* Your agent does not support the demote action (optional)
* Your agent does not support the promote action (optional)
* Your agent does not support master/slave (optional)
* Your agent does not support the reload action (optional)
Tests failed: /usr/lib/ocf/resource.d/heartbeat/IPaddr2 failed 1 tests


So, it seems to be telling me that even the standard built-in resource agents 
"produce meta-data which does not conform to ra-api-1.dtd"


My first question is: what's going wrong here?  Am I using ocf-tester 
incorrectly, or is it a bug?

My second question is: how can I debug what caused pacemaker to decide that it 
couldn't run Asterisk due to "insufficient privileges" on a machine which is 
perfectly well capacble of running Asterisk, and including when it gets 
started by pacemaker (in fact, that's the only way Asterisk gets started on 
these machines; it's a floating resource which pacemaker is in charge of).


Please let me know if I can provide any further information to help work out 
what's going on here.


Thanks,


Antony.

-- 
"Hi, I've found a fault with the English language and I need an entomologist."
"I think you mean an etymologist."
"No.  It's a bug, not a feature."

                                                   Please reply to the list;
                                                         please *don't* CC me.


More information about the Users mailing list