[ClusterLabs] ocf-tester always claims failure, even with built-in resource agents?
Antony Stone
Antony.Stone at ha.open.source.it
Fri Mar 26 17:43:59 EDT 2021
On Friday 26 March 2021 at 18:31:51, Ken Gaillot wrote:
> On Fri, 2021-03-26 at 19:59 +0300, Andrei Borzenkov wrote:
> > On 26.03.2021 17:28, Antony Stone wrote:
> > >
> > > So far all is well and good, my cluster synchronises, starts the
> > > resources, and everything's working as expected. It'll move the
> > > resources from one cluster member to another (either if I ask it to, or
> > > if there's a problem), and it seems to work just as the older version
> > > did.
>
> I'm glad this far was easy :)
Well, I've been using corosync & pacemaker for some years now; I've got used
to some of their quirks and foibles :)
Now I just need to learn about the new ones for the newer versions...
> It's worth noting that pacemaker itself doesn't try to validate the
> agent meta-data, it just checks for the pieces that are interesting to
> it and ignores the rest.
I guess that's good, so long as what it does pay attention to is what it wants
to see?
> It's also worth noting that the OCF 1.0 standard is horribly outdated
> compared to actual use, and the OCF 1.1 standard is being adopted today
> (!) after many years of trying to come up with something more up-to-
> date.
So, is ocf-tester no longer the right tools I should be using to check this
sort of thing? What shouold I be doing instead to make sure my configuration
is valid / acceptable to pacemaker?
> Bottom line, it's worth installing xmllint to see if that helps, but I
> wouldn't worry about meta-data schema issues.
Well, as stated in my other reply to Andrei, I now get:
/usr/lib/ocf/resource.d/heartbeat/asterisk passed all tests
/usr/lib/ocf/resource.d/heartbeat/anything passed all tests
so I guess it means my configuration file is okay, and I need to look somewher
eelse to find out why pacemaker 2.0.1 is throwing wobblies with exactly the
same resources that pacemaker 1.1.16 can manage quite happily and stably...
> > Either agent does not run as root or something blocks chown. Usual
> > suspects are apparmor or SELinux.
>
> Pacemaker itself can also return this error in certain cases, such as
> not having permissions to execute the agent. Check the pacemaker detail
> log (usually /var/log/pacemaker/pacemaker.log) and the system log
> around these times to see if there is more detail.
I've turned on debug logging, but I'm still not sure I'm seeing *exactly* what
the resource agent checker is doing when it gets this failure.
> It is definitely weird that a privileges error would be sporadic.
> Hopefully the logs can shed some more light.
I've captured a bunch of them this afternoon and will go through them on
Monday - it's pretty verbose!
> Another possibility would be to set trace_ra=1 on the actions that are
> failing to get line-by-line info from the agents.
So, that would be an extra parameter to the resource definition in cluster.cib?
Change:
primitive Asterisk asterisk meta migration-threshold=3 op monitor interval=5
timeout=30 on-fail=restart failure-timeout=10s
to:
primitive Asterisk asterisk meta migration-threshold=3 op monitor interval=5
timeout=30 on-fail=restart failure-timeout=10s trace_ra=1
?
Antony.
--
"It is easy to be blinded to the essential uselessness of them by the sense of
achievement you get from getting them to work at all. In other words - and
this is the rock solid principle on which the whole of the Corporation's
Galaxy-wide success is founded - their fundamental design flaws are completely
hidden by their superficial design flaws."
- Douglas Noel Adams
Please reply to the list;
please *don't* CC me.
More information about the Users
mailing list