[ClusterLabs] ocf-tester always claims failure, even with built-in resource agents?

Ken Gaillot kgaillot at redhat.com
Fri Mar 26 13:31:51 EDT 2021


On Fri, 2021-03-26 at 19:59 +0300, Andrei Borzenkov wrote:
> On 26.03.2021 17:28, Antony Stone wrote:
> > Hi.
> > 
> > I've just signed up to the list.  I've been using corosync and
> > pacemaker for 
> > several years, mostly under Debian 9, which means:
> > 
> > 	corosync 2.4.2
> > 	pacemaker 1.1.16
> > 
> > I've recently upgraded a test cluster to Debian 10, which gives me:
> > 
> > 	corosync 3.0.1
> > 	pacemaker 2.0.1
> > 
> > I've made a few adjustments to my /etc/corosync/corosync.conf
> > configuration so 
> > that corosync seems happy, and also some minor changes (mostly to
> > the cluster 
> > defaults) in /etc/corosync/cluster.cib so that pacemaker is happy.
> > 
> > So far all is well and good, my cluster synchronises, starts the
> > resources, 
> > and everything's working as expected.  It'll move the resources
> > from one 
> > cluster member to another (either if I ask it to, or if there's a
> > problem), 
> > and it seems to work just as the older version did.

I'm glad this far was easy :)

> > Then, several times a day, I get resource failures such as:
> > 
> > 	* Asterisk_start_0 on castor 'insufficient privileges' (4):
> > 	 call=58,
> > 	 status=complete,
> > 	 exitreason='',
> > 	 last-rc-change='Fri Mar 26 13:37:08 2021',
> > 	 queued=0ms,
> > 	 exec=55ms
> > 
> > I have no idea why the machine might tell me it cannot start
> > Asterisk due to 
> > insufficient privilege when it's already been able to run it before
> > the cluster 
> > resources moved back to this machine.  Asterisk *can* and *does*
> > run on this 
> > machine.
> > 
> > Another error I get is:
> > 
> > 	* Kann-Bear_monitor_5000 on helen 'unknown error' (1):
> > 	 call=62,
> > 	 status=complete,
> > 	 exitreason='',
> > 	 last-rc-change='Fri Mar 26 14:23:05 2021',
> > 	 queued=0ms,
> > 	 exec=0ms
> > 
> > Now, that second resource is one which doesn't have a standard
> > resource agent 
> > available for it under /usr/lib/ocf/resource.d, so I'm using the
> > general-
> > purpose agent /usr/lib/ocf/resource.d/heartbeat/anything to manage
> > it.
> > 
> > I thought, "perhaps there's something dodgy about using this
> > 'anything' agent, 
> > because it can't really know about the resource it's managing", so
> > I tested it 
> > with ocf-tester:
> > 
> > # ocf-tester -n Kann-Bear -o binfile="/usr/sbin/bearerbox" -o 
> > cmdline_options="/etc/kannel/kannel.conf" -o 
> > pidfile="/var/run/kannel/kannel_bearerbox.pid" 
> > /usr/lib/ocf/resource.d/heartbeat/anything
> > Beginning tests for /usr/lib/ocf/resource.d/heartbeat/anything...
> > /usr/sbin/ocf-tester: 226: /usr/sbin/ocf-tester: xmllint: not found
> > * rc=127: Your agent produces meta-data which does not conform to
> > ra-api-1.dtd
> > * Your agent does not support the notify action (optional)
> > * Your agent does not support the demote action (optional)
> > * Your agent does not support the promote action (optional)
> > * Your agent does not support master/slave (optional)
> > * Your agent does not support the reload action (optional)
> > Tests failed: /usr/lib/ocf/resource.d/heartbeat/anything failed 1
> > tests
> > 
> > Okay, something's not right.
> > 
> > BUT, it doesn't matter *which* resource agent I test, it tells me
> > the same 
> > thing every time, including for the built-in standard agents:
> > 
> > * rc=127: Your agent produces meta-data which does not conform to
> > ra-api-1.dtd
> > 
> > For example:
> > 
> > # ocf-tester -n Asterisk /usr/lib/ocf/resource.d/heartbeat/asterisk
> > Beginning tests for /usr/lib/ocf/resource.d/heartbeat/asterisk...
> > /usr/sbin/ocf-tester: 226: /usr/sbin/ocf-tester: xmllint: not found
> > * rc=127: Your agent produces meta-data which does not conform to
> > ra-api-1.dtd
> > * Your agent does not support the notify action (optional)
> > * Your agent does not support the demote action (optional)
> > * Your agent does not support the promote action (optional)
> > * Your agent does not support master/slave (optional)
> > * Your agent does not support the reload action (optional)
> > Tests failed: /usr/lib/ocf/resource.d/heartbeat/asterisk failed 1
> > tests
> > 
> > 
> > # ocf-tester -n IP-Float4 -o ip=10.1.0.42 -o cidr_netmask=28 
> > /usr/lib/ocf/resource.d/heartbeat/IPaddr2
> > Beginning tests for /usr/lib/ocf/resource.d/heartbeat/IPaddr2...
> > /usr/sbin/ocf-tester: 226: /usr/sbin/ocf-tester: xmllint: not found
> > * rc=127: Your agent produces meta-data which does not conform to
> > ra-api-1.dtd
> > * Your agent does not support the notify action (optional)
> > * Your agent does not support the demote action (optional)
> > * Your agent does not support the promote action (optional)
> > * Your agent does not support master/slave (optional)
> > * Your agent does not support the reload action (optional)
> > Tests failed: /usr/lib/ocf/resource.d/heartbeat/IPaddr2 failed 1
> > tests
> > 
> > 
> > So, it seems to be telling me that even the standard built-in
> > resource agents 
> > "produce meta-data which does not conform to ra-api-1.dtd"
> > 
> > 
> > My first question is: what's going wrong here?  Am I using ocf-
> > tester 
> > incorrectly, or is it a bug?
> > 
> 
> As is pretty clear from error messages, ocf-tester calls xmllint
> which
> is missing.

It's worth noting that pacemaker itself doesn't try to validate the
agent meta-data, it just checks for the pieces that are interesting to
it and ignores the rest.

It's also worth noting that the OCF 1.0 standard is horribly outdated
compared to actual use, and the OCF 1.1 standard is being adopted today
(!) after many years of trying to come up with something more up-to-
date.

Bottom line, it's worth installing xmllint to see if that helps, but I
wouldn't worry about meta-data schema issues.

> > My second question is: how can I debug what caused pacemaker to
> > decide that it 
> > couldn't run Asterisk due to "insufficient privileges" on a machine
> > which is 
> > perfectly well capacble of running Asterisk, and including when it
> > gets 
> > started by pacemaker (in fact, that's the only way Asterisk gets
> > started on 
> > these machines; it's a floating resource which pacemaker is in
> > charge of).
> > 
> 
> Agent returns this error if it fails to chown directory specified in
> its
> configuration file:
> 
>         # Regardless of whether we just created the directory or it
>         # already existed, check whether it is writable by the
> configured
>         # user
>         if ! su -s /bin/sh - $OCF_RESKEY_user -c "test -w $dir"; then
>             ocf_log warn "Directory $dir is not writable by
> $OCF_RESKEY_user, attempting chown"
>             ocf_run chown $OCF_RESKEY_user:$OCF_RESKEY_group $dir \
>                 || exit $OCF_ERR_PERM
> 
> Either agent does not run as root or something blocks chown. Usual
> suspects are apparmor or SELinux.

Pacemaker itself can also return this error in certain cases, such as
not having permissions to execute the agent. Check the pacemaker detail
log (usually /var/log/pacemaker/pacemaker.log) and the system log
around these times to see if there is more detail.

It is definitely weird that a privileges error would be sporadic.
Hopefully the logs can shed some more light.

Another possibility would be to set trace_ra=1 on the actions that are
failing to get line-by-line info from the agents.

> > Please let me know if I can provide any further information to help
> > work out 
> > what's going on here.
> > 
> > 
> > Thanks,
> > 
> > 
> > Antony.
> > 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 
-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Users mailing list