<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Mar 26, 2021 at 2:44 PM Antony Stone <<a href="mailto:Antony.Stone@ha.open.source.it">Antony.Stone@ha.open.source.it</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Friday 26 March 2021 at 18:31:51, Ken Gaillot wrote:<br>

<br>

> On Fri, 2021-03-26 at 19:59 +0300, Andrei Borzenkov wrote:<br>

> > On 26.03.2021 17:28, Antony Stone wrote:<br>

> > > <br>

> > > So far all is well and good, my cluster synchronises, starts the<br>

> > > resources, and everything's working as expected.  It'll move the<br>

> > > resources from one cluster member to another (either if I ask it to, or<br>

> > > if there's a problem), and it seems to work just as the older version<br>

> > > did.<br>

> <br>

> I'm glad this far was easy :)<br>

<br>

Well, I've been using corosync & pacemaker for some years now; I've got used <br>

to some of their quirks and foibles :)<br>

<br>

Now I just need to learn about the new ones for the newer versions...<br>

<br>

> It's worth noting that pacemaker itself doesn't try to validate the<br>

> agent meta-data, it just checks for the pieces that are interesting to<br>

> it and ignores the rest.<br>

<br>

I guess that's good, so long as what it does pay attention to is what it wants <br>

to see?<br>

<br>

> It's also worth noting that the OCF 1.0 standard is horribly outdated<br>

> compared to actual use, and the OCF 1.1 standard is being adopted today<br>

> (!) after many years of trying to come up with something more up-to-<br>

> date.<br>

<br>

So, is ocf-tester no longer the right tools I should be using to check this <br>

sort of thing?  What shouold I be doing instead to make sure my configuration <br>

is valid / acceptable to pacemaker?<br>

<br>

> Bottom line, it's worth installing xmllint to see if that helps, but I<br>

> wouldn't worry about meta-data schema issues.<br>

<br>

Well, as stated in my other reply to Andrei, I now get:<br>

<br>

/usr/lib/ocf/resource.d/heartbeat/asterisk passed all tests<br>

<br>

/usr/lib/ocf/resource.d/heartbeat/anything passed all tests<br>

<br>

so I guess it means my configuration file is okay, and I need to look somewher <br>

eelse to find out why pacemaker 2.0.1 is throwing wobblies with exactly the <br>

same resources that pacemaker 1.1.16 can manage quite happily and stably...<br>

<br>

> > Either agent does not run as root or something blocks chown. Usual<br>

> > suspects are apparmor or SELinux.<br>

> <br>

> Pacemaker itself can also return this error in certain cases, such as<br>

> not having permissions to execute the agent. Check the pacemaker detail<br>

> log (usually /var/log/pacemaker/pacemaker.log) and the system log<br>

> around these times to see if there is more detail.<br>

<br>

I've turned on debug logging, but I'm still not sure I'm seeing *exactly* what <br>

the resource agent checker is doing when it gets this failure.<br>

<br>

> It is definitely weird that a privileges error would be sporadic.<br>

> Hopefully the logs can shed some more light.<br>

<br>

I've captured a bunch of them this afternoon and will go through them on <br>

Monday - it's pretty verbose!<br>

<br>

> Another possibility would be to set trace_ra=1 on the actions that are<br>

> failing to get line-by-line info from the agents.<br>

<br>

So, that would be an extra parameter to the resource definition in cluster.cib?<br>

<br>

Change:<br>

<br>

primitive Asterisk asterisk meta migration-threshold=3 op monitor interval=5 <br>

timeout=30 on-fail=restart failure-timeout=10s<br>

<br>

to:<br>

<br>

primitive Asterisk asterisk meta migration-threshold=3 op monitor interval=5 <br>

timeout=30 on-fail=restart failure-timeout=10s trace_ra=1<br>

<br>

        ?<br></blockquote><div><br></div><div>It's an instance attribute, not a meta attribute. I'm not familiar with crmsh syntax but trace_ra=1 would go wherever you would configure a "normal" option, like `ip=x.x.x.x` for an IPaddr2 resource. It will save a shell trace of each operation to a file in /var/lib/heartbeat/trace_ra/asterisk. You would then wait for an operation to fail, find the file containing that operation's trace, and see what it tells you about the error.</div><div><br></div><div>You might already have some more detail about the error in /var/log/messages and/or /var/log/pacemaker/pacemaker.log. Look in /var/log/messages around Fri Mar 26 13:37:08 2021 on the node where the failure occurred. See if there are any additional messages from the resource agent, or any stdout or stderr logged by lrmd/pacemaker-execd for the Asterisk resource.<br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

<br>

Antony.<br>

<br>

-- <br>

"It is easy to be blinded to the essential uselessness of them by the sense of <br>

achievement you get from getting them to work at all. In other words - and <br>

this is the rock solid principle on which the whole of the Corporation's <br>

Galaxy-wide success is founded - their fundamental design flaws are completely <br>

hidden by their superficial design flaws."<br>

<br>

 - Douglas Noel Adams<br>

<br>

                                                   Please reply to the list;<br>

                                                         please *don't* CC me.<br>

_______________________________________________<br>

Manage your subscription:<br>

<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

<br>

ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

<br>

</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div>Regards,<br><br></div>Reid Wahl, RHCA<br></div><div>Senior Software Maintenance Engineer, Red Hat<br></div>CEE - Platform Support Delivery - ClusterHA</div></div></div></div></div></div></div></div></div></div></div></div></div></div></div>