[ClusterLabs] ocf-tester always claims failure, even with built-in resource agents?

Fri Mar 26 18:28:32 EDT 2021

On Fri, Mar 26, 2021 at 2:44 PM Antony Stone <Antony.Stone at ha.open.source.it>
wrote:

> On Friday 26 March 2021 at 18:31:51, Ken Gaillot wrote:
>
> > On Fri, 2021-03-26 at 19:59 +0300, Andrei Borzenkov wrote:
> > > On 26.03.2021 17:28, Antony Stone wrote:
> > > >
> > > > So far all is well and good, my cluster synchronises, starts the
> > > > resources, and everything's working as expected.  It'll move the
> > > > resources from one cluster member to another (either if I ask it to,
> or
> > > > if there's a problem), and it seems to work just as the older version
> > > > did.
> >
> > I'm glad this far was easy :)
>
> Well, I've been using corosync & pacemaker for some years now; I've got
> used
> to some of their quirks and foibles :)
>
> Now I just need to learn about the new ones for the newer versions...
>
> > It's worth noting that pacemaker itself doesn't try to validate the
> > agent meta-data, it just checks for the pieces that are interesting to
> > it and ignores the rest.
>
> I guess that's good, so long as what it does pay attention to is what it
> wants
> to see?
>
> > It's also worth noting that the OCF 1.0 standard is horribly outdated
> > compared to actual use, and the OCF 1.1 standard is being adopted today
> > (!) after many years of trying to come up with something more up-to-
> > date.
>
> So, is ocf-tester no longer the right tools I should be using to check
> this
> sort of thing?  What shouold I be doing instead to make sure my
> configuration
> is valid / acceptable to pacemaker?
>
> > Bottom line, it's worth installing xmllint to see if that helps, but I
> > wouldn't worry about meta-data schema issues.
>
> Well, as stated in my other reply to Andrei, I now get:
>
> /usr/lib/ocf/resource.d/heartbeat/asterisk passed all tests
>
> /usr/lib/ocf/resource.d/heartbeat/anything passed all tests
>
> so I guess it means my configuration file is okay, and I need to look
> somewher
> eelse to find out why pacemaker 2.0.1 is throwing wobblies with exactly
> the
> same resources that pacemaker 1.1.16 can manage quite happily and stably...
>
> > > Either agent does not run as root or something blocks chown. Usual
> > > suspects are apparmor or SELinux.
> >
> > Pacemaker itself can also return this error in certain cases, such as
> > not having permissions to execute the agent. Check the pacemaker detail
> > log (usually /var/log/pacemaker/pacemaker.log) and the system log
> > around these times to see if there is more detail.
>
> I've turned on debug logging, but I'm still not sure I'm seeing *exactly*
> what
> the resource agent checker is doing when it gets this failure.
>
> > It is definitely weird that a privileges error would be sporadic.
> > Hopefully the logs can shed some more light.
>
> I've captured a bunch of them this afternoon and will go through them on
> Monday - it's pretty verbose!
>
> > Another possibility would be to set trace_ra=1 on the actions that are
> > failing to get line-by-line info from the agents.
>
> So, that would be an extra parameter to the resource definition in
> cluster.cib?
>
> Change:
>
> primitive Asterisk asterisk meta migration-threshold=3 op monitor
> interval=5
> timeout=30 on-fail=restart failure-timeout=10s
>
> to:
>
> primitive Asterisk asterisk meta migration-threshold=3 op monitor
> interval=5
> timeout=30 on-fail=restart failure-timeout=10s trace_ra=1
>
>         ?
>

It's an instance attribute, not a meta attribute. I'm not familiar with
crmsh syntax but trace_ra=1 would go wherever you would configure a
"normal" option, like `ip=x.x.x.x` for an IPaddr2 resource. It will save a
shell trace of each operation to a file in
/var/lib/heartbeat/trace_ra/asterisk. You would then wait for an operation
to fail, find the file containing that operation's trace, and see what it
tells you about the error.

You might already have some more detail about the error in
/var/log/messages and/or /var/log/pacemaker/pacemaker.log. Look in
/var/log/messages around Fri Mar 26 13:37:08 2021 on the node where the
failure occurred. See if there are any additional messages from the
resource agent, or any stdout or stderr logged by lrmd/pacemaker-execd for
the Asterisk resource.

>
> Antony.
>
> --
> "It is easy to be blinded to the essential uselessness of them by the
> sense of
> achievement you get from getting them to work at all. In other words - and
> this is the rock solid principle on which the whole of the Corporation's
> Galaxy-wide success is founded - their fundamental design flaws are
> completely
> hidden by their superficial design flaws."
>
>  - Douglas Noel Adams
>
>                                                    Please reply to the
> list;
>                                                          please *don't* CC
> me.
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>

-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20210326/a70b4298/attachment.htm>