[ClusterLabs] ocf-tester always claims failure, even with built-in resource agents?

Fri Mar 26 17:43:59 EDT 2021

On Friday 26 March 2021 at 18:31:51, Ken Gaillot wrote:

> On Fri, 2021-03-26 at 19:59 +0300, Andrei Borzenkov wrote:
> > On 26.03.2021 17:28, Antony Stone wrote:
> > > 
> > > So far all is well and good, my cluster synchronises, starts the
> > > resources, and everything's working as expected.  It'll move the
> > > resources from one cluster member to another (either if I ask it to, or
> > > if there's a problem), and it seems to work just as the older version
> > > did.
> 
> I'm glad this far was easy :)

Well, I've been using corosync & pacemaker for some years now; I've got used 
to some of their quirks and foibles :)

Now I just need to learn about the new ones for the newer versions...

> It's worth noting that pacemaker itself doesn't try to validate the
> agent meta-data, it just checks for the pieces that are interesting to
> it and ignores the rest.

I guess that's good, so long as what it does pay attention to is what it wants 
to see?

> It's also worth noting that the OCF 1.0 standard is horribly outdated
> compared to actual use, and the OCF 1.1 standard is being adopted today
> (!) after many years of trying to come up with something more up-to-
> date.

So, is ocf-tester no longer the right tools I should be using to check this 
sort of thing?  What shouold I be doing instead to make sure my configuration 
is valid / acceptable to pacemaker?

> Bottom line, it's worth installing xmllint to see if that helps, but I
> wouldn't worry about meta-data schema issues.

Well, as stated in my other reply to Andrei, I now get:

/usr/lib/ocf/resource.d/heartbeat/asterisk passed all tests

/usr/lib/ocf/resource.d/heartbeat/anything passed all tests

so I guess it means my configuration file is okay, and I need to look somewher 
eelse to find out why pacemaker 2.0.1 is throwing wobblies with exactly the 
same resources that pacemaker 1.1.16 can manage quite happily and stably...

> > Either agent does not run as root or something blocks chown. Usual
> > suspects are apparmor or SELinux.
> 
> Pacemaker itself can also return this error in certain cases, such as
> not having permissions to execute the agent. Check the pacemaker detail
> log (usually /var/log/pacemaker/pacemaker.log) and the system log
> around these times to see if there is more detail.

I've turned on debug logging, but I'm still not sure I'm seeing *exactly* what 
the resource agent checker is doing when it gets this failure.

> It is definitely weird that a privileges error would be sporadic.
> Hopefully the logs can shed some more light.

I've captured a bunch of them this afternoon and will go through them on 
Monday - it's pretty verbose!

> Another possibility would be to set trace_ra=1 on the actions that are
> failing to get line-by-line info from the agents.

So, that would be an extra parameter to the resource definition in cluster.cib?

Change:

primitive Asterisk asterisk meta migration-threshold=3 op monitor interval=5 
timeout=30 on-fail=restart failure-timeout=10s

to:

primitive Asterisk asterisk meta migration-threshold=3 op monitor interval=5 
timeout=30 on-fail=restart failure-timeout=10s trace_ra=1

	?

Antony.

-- 
"It is easy to be blinded to the essential uselessness of them by the sense of 
achievement you get from getting them to work at all. In other words - and 
this is the rock solid principle on which the whole of the Corporation's 
Galaxy-wide success is founded - their fundamental design flaws are completely 
hidden by their superficial design flaws."

 - Douglas Noel Adams

                                                   Please reply to the list;
                                                         please *don't* CC me.