<div dir="ltr">Yep, that works fine. Thanks for the explanation.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Apr 18, 2019 at 5:00 PM Ken Gaillot <<a href="mailto:kgaillot@redhat.com">kgaillot@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Thu, 2019-04-18 at 15:51 -0600, JCA wrote:<br>

> I have my CentOS two-node cluster, which some of you may already be<br>

> sick and tired of reading about:<br>

> <br>

> # pcs status<br>

> Cluster name: FirstCluster<br>

> Stack: corosync<br>

> Current DC: two (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition<br>

> with quorum<br>

> Last updated: Thu Apr 18 13:52:38 2019<br>

> Last change: Thu Apr 18 13:50:57 2019 by root via cibadmin on one<br>

> <br>

> 2 nodes configured<br>

> 5 resources configured<br>

> <br>

> Online: [ one two ]<br>

> <br>

> Full list of resources:<br>

> <br>

>  MyCluster  (ocf::myapp:myapp-script):  Started two<br>

>  Master/Slave Set: DrbdDataClone [DrbdData]<br>

>      Masters: [ two ]<br>

>      Slaves: [ one ]<br>

>  DrbdFS (ocf::heartbeat:Filesystem):  Started two<br>

>  disk_fencing (stonith:fence_scsi): Started one<br>

> <br>

> Daemon Status:<br>

>   corosync: active/enabled<br>

>   pacemaker: active/enabled<br>

>   pcsd: active/enabled<br>

> <br>

> I can stop either node, and the other will take over as expected.<br>

> Here is the thing though:<br>

> <br>

> myapp-script starts, stops and monitors the actual application that I<br>

> am interested in. I'll call this application A. At the OS level, A is<br>

> of course listed when I do ps awux. <br>

> <br>

> In the situation above, where A is running on two, I can kill A from<br>

> the CentOS command line in two. Shortly after doing so, Pacemaker<br>

> invokes myapp-script in two, in the following ways and returning the<br>

> following values:<br>

> <br>

>    monitor: OCF_NOT_RUNNING<br>

>    stop: OCF_SUCCESS<br>

>    start: OCF_SUCCESS <br>

>    monitor: OCF_SUCCESS<br>

>  <br>

> After this, with ps auwx in two I can see that A is indeed up and<br>

> running. However, the output from pcs status (in either one or two)<br>

> is now the following:<br>

> <br>

> Cluster name: FirstCluster<br>

> Stack: corosync<br>

> Current DC: two (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition<br>

> with quorum<br>

> Last updated: Thu Apr 18 15:21:25 2019<br>

> Last change: Thu Apr 18 13:50:57 2019 by root via cibadmin on one<br>

> <br>

> 2 nodes configured<br>

> 5 resources configured<br>

> <br>

> Online: [ one two ]<br>

> <br>

> Full list of resources:<br>

> <br>

>  MyCluster  (ocf::myapp:myapp-script):  Started two<br>

>  Master/Slave Set: DrbdDataClone [DrbdData]<br>

>      Masters: [ two ]<br>

>      Slaves: [ one ]<br>

>  DrbdFS (ocf::heartbeat:Filesystem):  Started two<br>

>  disk_fencing (stonith:fence_scsi): Started one<br>

> <br>

> Failed Actions:<br>

> * MyCluster_monitor_30000 on two 'not running' (7): call=35,<br>

> status=complete, exitreason='',<br>

>     last-rc-change='Thu Apr 18 15:21:12 2019', queued=0ms, exec=0ms<br>

> <br>

> <br>

> Daemon Status:<br>

>   corosync: active/enabled<br>

>   pacemaker: active/enabled<br>

>   pcsd: active/enabled<br>

> <br>

> And the cluster seems to stay stuck there, until I stop and start<br>

> node two explicitly.<br>

> <br>

>        Is this the expected behavior? What I was expecting is for<br>

> Pacemaker to restart A, in either node - which it indeed does, in two<br>

> itself. But pcs status seems to think that an error happened when<br>

> trying to restart A - despite the fact that it got A restarted all<br>

> right. And I know that A is running correctly to boot.<br>

> <br>

>        What am I misunderstanding here?<br>

<br>

You got everything right, except the display is not saying the restart<br>

failed -- it's saying there was a monitor failure that led to the<br>

restart. The "failed actions" section is a history rather than the<br>

current status (which is the "full cluster status" section).<br>

<br>

The idea is that failures might occur when you're not looking :) and<br>

you can see that they happened the next time you check the status, even<br>

if the cluster was able to recover successfully.<br>

<br>

To clear the history, run "crm_resource -C -r MyCluster" (or "pcs<br>

resource cleanup MyCluster" if you're using pcs).<br>

-- <br>

Ken Gaillot <<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>><br>

<br>

_______________________________________________<br>

Manage your subscription:<br>

<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

<br>

ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

</blockquote></div>