[ClusterLabs] ocf-tester always claims failure, even with built-in resource agents?
Andrei Borzenkov
arvidjaar at gmail.com
Fri Mar 26 12:59:07 EDT 2021
On 26.03.2021 17:28, Antony Stone wrote:
> Hi.
>
> I've just signed up to the list. I've been using corosync and pacemaker for
> several years, mostly under Debian 9, which means:
>
> corosync 2.4.2
> pacemaker 1.1.16
>
> I've recently upgraded a test cluster to Debian 10, which gives me:
>
> corosync 3.0.1
> pacemaker 2.0.1
>
> I've made a few adjustments to my /etc/corosync/corosync.conf configuration so
> that corosync seems happy, and also some minor changes (mostly to the cluster
> defaults) in /etc/corosync/cluster.cib so that pacemaker is happy.
>
> So far all is well and good, my cluster synchronises, starts the resources,
> and everything's working as expected. It'll move the resources from one
> cluster member to another (either if I ask it to, or if there's a problem),
> and it seems to work just as the older version did.
>
> Then, several times a day, I get resource failures such as:
>
> * Asterisk_start_0 on castor 'insufficient privileges' (4):
> call=58,
> status=complete,
> exitreason='',
> last-rc-change='Fri Mar 26 13:37:08 2021',
> queued=0ms,
> exec=55ms
>
> I have no idea why the machine might tell me it cannot start Asterisk due to
> insufficient privilege when it's already been able to run it before the cluster
> resources moved back to this machine. Asterisk *can* and *does* run on this
> machine.
>
> Another error I get is:
>
> * Kann-Bear_monitor_5000 on helen 'unknown error' (1):
> call=62,
> status=complete,
> exitreason='',
> last-rc-change='Fri Mar 26 14:23:05 2021',
> queued=0ms,
> exec=0ms
>
> Now, that second resource is one which doesn't have a standard resource agent
> available for it under /usr/lib/ocf/resource.d, so I'm using the general-
> purpose agent /usr/lib/ocf/resource.d/heartbeat/anything to manage it.
>
> I thought, "perhaps there's something dodgy about using this 'anything' agent,
> because it can't really know about the resource it's managing", so I tested it
> with ocf-tester:
>
> # ocf-tester -n Kann-Bear -o binfile="/usr/sbin/bearerbox" -o
> cmdline_options="/etc/kannel/kannel.conf" -o
> pidfile="/var/run/kannel/kannel_bearerbox.pid"
> /usr/lib/ocf/resource.d/heartbeat/anything
> Beginning tests for /usr/lib/ocf/resource.d/heartbeat/anything...
> /usr/sbin/ocf-tester: 226: /usr/sbin/ocf-tester: xmllint: not found
> * rc=127: Your agent produces meta-data which does not conform to ra-api-1.dtd
> * Your agent does not support the notify action (optional)
> * Your agent does not support the demote action (optional)
> * Your agent does not support the promote action (optional)
> * Your agent does not support master/slave (optional)
> * Your agent does not support the reload action (optional)
> Tests failed: /usr/lib/ocf/resource.d/heartbeat/anything failed 1 tests
>
> Okay, something's not right.
>
> BUT, it doesn't matter *which* resource agent I test, it tells me the same
> thing every time, including for the built-in standard agents:
>
> * rc=127: Your agent produces meta-data which does not conform to ra-api-1.dtd
>
> For example:
>
> # ocf-tester -n Asterisk /usr/lib/ocf/resource.d/heartbeat/asterisk
> Beginning tests for /usr/lib/ocf/resource.d/heartbeat/asterisk...
> /usr/sbin/ocf-tester: 226: /usr/sbin/ocf-tester: xmllint: not found
> * rc=127: Your agent produces meta-data which does not conform to ra-api-1.dtd
> * Your agent does not support the notify action (optional)
> * Your agent does not support the demote action (optional)
> * Your agent does not support the promote action (optional)
> * Your agent does not support master/slave (optional)
> * Your agent does not support the reload action (optional)
> Tests failed: /usr/lib/ocf/resource.d/heartbeat/asterisk failed 1 tests
>
>
> # ocf-tester -n IP-Float4 -o ip=10.1.0.42 -o cidr_netmask=28
> /usr/lib/ocf/resource.d/heartbeat/IPaddr2
> Beginning tests for /usr/lib/ocf/resource.d/heartbeat/IPaddr2...
> /usr/sbin/ocf-tester: 226: /usr/sbin/ocf-tester: xmllint: not found
> * rc=127: Your agent produces meta-data which does not conform to ra-api-1.dtd
> * Your agent does not support the notify action (optional)
> * Your agent does not support the demote action (optional)
> * Your agent does not support the promote action (optional)
> * Your agent does not support master/slave (optional)
> * Your agent does not support the reload action (optional)
> Tests failed: /usr/lib/ocf/resource.d/heartbeat/IPaddr2 failed 1 tests
>
>
> So, it seems to be telling me that even the standard built-in resource agents
> "produce meta-data which does not conform to ra-api-1.dtd"
>
>
> My first question is: what's going wrong here? Am I using ocf-tester
> incorrectly, or is it a bug?
>
As is pretty clear from error messages, ocf-tester calls xmllint which
is missing.
> My second question is: how can I debug what caused pacemaker to decide that it
> couldn't run Asterisk due to "insufficient privileges" on a machine which is
> perfectly well capacble of running Asterisk, and including when it gets
> started by pacemaker (in fact, that's the only way Asterisk gets started on
> these machines; it's a floating resource which pacemaker is in charge of).
>
Agent returns this error if it fails to chown directory specified in its
configuration file:
# Regardless of whether we just created the directory or it
# already existed, check whether it is writable by the configured
# user
if ! su -s /bin/sh - $OCF_RESKEY_user -c "test -w $dir"; then
ocf_log warn "Directory $dir is not writable by
$OCF_RESKEY_user, attempting chown"
ocf_run chown $OCF_RESKEY_user:$OCF_RESKEY_group $dir \
|| exit $OCF_ERR_PERM
Either agent does not run as root or something blocks chown. Usual
suspects are apparmor or SELinux.
>
> Please let me know if I can provide any further information to help work out
> what's going on here.
>
>
> Thanks,
>
>
> Antony.
>
More information about the Users
mailing list