<div dir="ltr"><div dir="ltr"><div dir="ltr">I have my CentOS two-node cluster, which some of you may already be sick and tired of reading about:<div><br></div><div># pcs status</div><div><div>Cluster name: FirstCluster</div><div>Stack: corosync</div><div>Current DC: two (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with quorum</div><div>Last updated: Thu Apr 18 13:52:38 2019</div><div>Last change: Thu Apr 18 13:50:57 2019 by root via cibadmin on one</div><div><br></div><div>2 nodes configured</div><div>5 resources configured</div><div><br></div><div>Online: [ one two ]</div><div><br></div><div>Full list of resources:</div><div><br></div><div> MyCluster  (ocf::myapp:myapp-script):  Started two</div><div> Master/Slave Set: DrbdDataClone [DrbdData]</div><div>     Masters: [ two ]</div><div>     Slaves: [ one ]</div><div> DrbdFS (ocf::heartbeat:Filesystem):  Started two</div><div> disk_fencing (stonith:fence_scsi): Started one</div><div><br></div><div>Daemon Status:</div><div>  corosync: active/enabled</div><div>  pacemaker: active/enabled</div><div>  pcsd: active/enabled</div></div><div><br></div><div>I can stop either node, and the other will take over as expected. Here is the thing though:</div><div><br></div><div>myapp-script starts, stops and monitors the actual application that I am interested in. I'll call this application A. At the OS level, A is of course listed when I do ps awux. </div><div><br></div><div>In the situation above, where A is running on two, I can kill A from the CentOS command line in two. Shortly after doing so, Pacemaker invokes myapp-script in two, in the following ways and returning the following values:</div><div><br></div><div>   monitor: OCF_NOT_RUNNING</div><div>   stop: OCF_SUCCESS</div><div>   start: OCF_SUCCESS </div><div>   monitor: OCF_SUCCESS</div><div> </div><div>After this, with ps auwx in two I can see that A is indeed up and running. However, the output from pcs status (in either one or two) is now the following:</div><div><br></div><div><div>Cluster name: FirstCluster</div><div>Stack: corosync</div><div>Current DC: two (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with quorum</div><div>Last updated: Thu Apr 18 15:21:25 2019</div><div>Last change: Thu Apr 18 13:50:57 2019 by root via cibadmin on one</div><div><br></div><div>2 nodes configured</div><div>5 resources configured</div><div><br></div><div>Online: [ one two ]</div><div><br></div><div>Full list of resources:</div><div><br></div><div> MyCluster  (ocf::myapp:myapp-script):  Started two</div><div> Master/Slave Set: DrbdDataClone [DrbdData]</div><div>     Masters: [ two ]</div><div>     Slaves: [ one ]</div><div> DrbdFS (ocf::heartbeat:Filesystem):  Started two</div><div> disk_fencing (stonith:fence_scsi): Started one</div><div><br></div><div>Failed Actions:</div><div>* MyCluster_monitor_30000 on two 'not running' (7): call=35, status=complete, exitreason='',</div><div>    last-rc-change='Thu Apr 18 15:21:12 2019', queued=0ms, exec=0ms</div><div><br></div><div><br></div><div>Daemon Status:</div><div>  corosync: active/enabled</div><div>  pacemaker: active/enabled</div><div>  pcsd: active/enabled</div></div><div><br></div><div>And the cluster seems to stay stuck there, until I stop and start node two explicitly.</div><div><br></div><div>       Is this the expected behavior? What I was expecting is for Pacemaker to restart A, in either node - which it indeed does, in two itself. But pcs status seems to think that an error happened when trying to restart A - despite the fact that it got A restarted all right. And I know that A is running correctly to boot.</div><div><br></div><div>       What am I misunderstanding here?</div><div><br></div></div></div></div>