<div dir="ltr"><div>Thanks. See my comments interspersed below.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Apr 15, 2019 at 4:30 PM Ken Gaillot <<a href="mailto:kgaillot@redhat.com">kgaillot@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Mon, 2019-04-15 at 14:15 -0600, JCA wrote:<br>

> I have a simple two-node cluster, node one and node two, with a<br>

> single resource, ClusterMyApp. The nodes are CentOS 7 VMs. The<br>

> resource is created executing the following line in node one:<br>

> <br>

>    # pcs resource create ClusterMyApp ocf:myapp:myapp-script op<br>

> monitor interval=30s<br>

<br>

FYI, it doesn't matter which node you change the configuration on --<br>

it's automatically sync'd across all nodes.<br>

<br></blockquote><div> </div><div>   OK. </div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

> This invokes myapp-script, which I installed under<br>

> /usr/lib/ocf/resource.d/myapp/myapp-script, both in one and two -<br>

> i.e. it is exactly the same script in both nodes. <br>

> <br>

> On executing the command above in node one, I get the following log<br>

> entries in node one itself:<br>

> <br>

> Apr 15 13:40:12 one crmd[13670]:  notice: Result of probe operation<br>

> for ClusterMyApp on one: 7 (not running)<br>

> Apr 15 13:40:12 one crmd[13670]:  notice: Result of start operation<br>

> for ClusterMyApp on one: 0 (ok)<br>

<br>

Whenever starting the cluster on a node, or adding a resource,<br>

pacemaker probes the state of all resources on the node (or all nodes<br>

in the case of adding a resource) by calling the agent's "monitor"<br>

action once. You'll see this "Result of probe operation" for all<br>

resources on all nodes.<br>

<br>

This allows pacemaker to detect if and where a service is already<br>

running.<br>

<br></blockquote><div> </div><div>   OK.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

> This is in line with what I expect from myapp-script when invoked<br>

> with the 'start' option (which is what the command above is doing.)<br>

> myapp-script first checks out whether my app is running, and if it is<br>

> not then launches it. The rest of the log entries are to do with my<br>

> app, indicating that it started without any problems.<br>

> <br>

> In node two, when the command above is executed in one, the following<br>

> log entries are generated:<br>

> <br>

> Apr 15 13:40:12 two crmd[4293]:  notice: State transition S_IDLE -><br>

> S_POLICY_ENGINE<br>

> Apr 15 13:40:12 two pengine[4292]:  notice:  * Start     <br>

> ClusterMyApp     (one )<br>

> Apr 15 13:40:12 two pengine[4292]:  notice: Calculated transition 16,<br>

> saving inputs in /var/lib/pacemaker/pengine/pe-input-66.bz2<br>

<br>

At any given time, one node in the cluster is elected the "Designated<br>

Controller" (DC). That node will calculate what (if anything) needs to<br>

be done, and tell the right nodes to do it. Above, it has determined<br>

that ClusterMyApp needs to be started on node one.<br></blockquote><div><br></div><div>  OK.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

> Apr 15 13:40:12 two crmd[4293]:  notice: Initiating monitor operation<br>

> ClusterMyApp_monitor_0 locally on two<br>

> Apr 15 13:40:12 two crmd[4293]:  notice: Initiating monitor operation<br>

> ClusterMyApp_monitor_0 on one<br>

<br>

The cluster first probes the current state of the service on both<br>

nodes, before any actions have been taken. The expected result is "not<br>

running".<br></blockquote><div><br></div><div>   OK.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

> Apr 15 13:40:12 two crmd[4293]:  notice: Result of probe operation<br>

> for ClusterMyApp on two: 7 (not running)<br>

> Apr 15 13:40:12 two crmd[4293]:  notice: Initiating start operation<br>

> ClusterMyApp_start_0 on one<br>

> Apr 15 13:40:12 two crmd[4293]:  notice: Initiating monitor operation<br>

> ClusterMyApp_monitor_30000 on one<br>

> Apr 15 13:40:12 two crmd[4293]: warning: Action 4<br>

> (ClusterMyApp_monitor_30000) on one failed (target: 0 vs. rc: 7):<br>

> Error<br>

> Apr 15 13:40:12 two crmd[4293]:  notice: Transition aborted by<br>

> operation ClusterMyApp_monitor_30000 'create' on one: Event failed<br>

<br>

The cluster successfully probed the service on both nodes, and started<br>

it on node one. It then tried to start a 30-second recurring monitor<br>

for the service, but the monitor immediately failed (the expected<br>

result was running, but the monitor said it was not running).<br></blockquote><div><br></div><div>    It failed, where? In one, I know for a fact that my app is running, as reported by ps. I also know it has started correctly and is sitting there for stuff to do - it depends on timers and external events. In two, of course, it is not running.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

> After doing all of the above, pcs status returns the following, when<br>

> invoked in either node:<br>

> <br>

> Cluster name: MyCluster<br>

> Stack: corosync<br>

> Current DC: two (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition<br>

> with quorum<br>

> Last updated: Mon Apr 15 13:45:14 2019<br>

> Last change: Mon Apr 15 13:40:11 2019 by root via cibadmin on one<br>

> <br>

> 2 nodes configured<br>

> 1 resource configured<br>

> <br>

> Online: [ one two ]<br>

> <br>

> Full list of resources:<br>

> <br>

>  ClusterMyApp (ocf::myapp:myapp-script):      Started one<br>

> <br>

> Failed Actions:<br>

> * ClusterMyApp_monitor_30000 on one 'not running' (7): call=37,<br>

> status=complete, exitreason='',<br>

>     last-rc-change='Mon Apr 15 13:40:12 2019', queued=0ms, exec=105ms<br>

> <br>

> <br>

> Daemon Status:<br>

>   corosync: active/disabled<br>

>   pacemaker: active/disabled<br>

>   pcsd: active/enabled<br>

> <br>

> The start function in this script is as follows:<br>

> <br>

> myapp_start() {<br>

>   myapp_conf_check<br>

>   local diagnostic=$?<br>

> <br>

>   if [ $diagnostic -ne $OCF_SUCCESS ]; then<br>

>     return $diagnostic<br>

>   fi<br>

> <br>

>   myapp_monitor<br>

> <br>

>   local state=$?<br>

> <br>

>   case $state in<br>

>     $OCF_SUCCESS)<br>

>         return $OCF_SUCCESS<br>

>       ;;<br>

> <br>

>     $OCF_NOT_RUNNING)<br>

>       myapp_launch > /dev/null 2>&1<br>

>       if [ $?  -eq 0 ]; then<br>

>         return $OCF_SUCCESS<br>

>       fi<br>

> <br>

>       return $OCF_ERR_GENERIC<br>

>       ;;<br>

> <br>

>     *)<br>

>       return $state<br>

>       ;;<br>

>   esac<br>

> }<br>

> <br>

> I know for a fact that, in one, myapp_launch gets invoked, and that<br>

> its exit value is 0. The function therefore returns OCF_SUCCESS, as<br>

> it should. However, if I understand things correctly, the log entries<br>

> in two seem to claim that the exit value of the script in one is<br>

> OCF_NOT_RUNNING. <br>

<br>

The start succeeded. It's the recurring monitor that failed.<br></blockquote><div><br></div><div>  I assume that the recurring monitor invokes the myapp_monitor function that I created at the same time as myapp_start. Well, as far as I can tell, the problem seems to be that, as I mentioned in later posts. myapp_start gets invoked several times when creating the resource. As a result - although I have yet to understand it in detail - myapp_monitor occasionally fails in that sequence. Whether the resource starts correctly or no depends on what happens the last time myapp_monitor is invoked when creating the resource - which is why the whole thing works every so often. </div><div><br></div><div>  Why is myapp_start invoked so many times on creating the resource? Who/what is controlling this? Is it configurable?</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

> <br>

> What's going on here? It's obviously something to do with myapp-<br>

> script - but, what? <br>

-- <br>

Ken Gaillot <<a href="mailto:kgaillot@redhat.com" target="_top">kgaillot@redhat.com</a>><br>

<br>

_______________________________________________<br>

Manage your subscription:<br>

<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_top">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

<br>

ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_top">https://www.clusterlabs.org/</a><br>

</blockquote></div></div>