[ClusterLabs] Failover event not reported correctly?

Thu Apr 18 17:51:37 EDT 2019

I have my CentOS two-node cluster, which some of you may already be sick
and tired of reading about:

# pcs status
Cluster name: FirstCluster
Stack: corosync
Current DC: two (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with
quorum
Last updated: Thu Apr 18 13:52:38 2019
Last change: Thu Apr 18 13:50:57 2019 by root via cibadmin on one

2 nodes configured
5 resources configured

Online: [ one two ]

Full list of resources:

 MyCluster  (ocf::myapp:myapp-script):  Started two
 Master/Slave Set: DrbdDataClone [DrbdData]
     Masters: [ two ]
     Slaves: [ one ]
 DrbdFS (ocf::heartbeat:Filesystem):  Started two
 disk_fencing (stonith:fence_scsi): Started one

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

I can stop either node, and the other will take over as expected. Here is
the thing though:

myapp-script starts, stops and monitors the actual application that I am
interested in. I'll call this application A. At the OS level, A is of
course listed when I do ps awux.

In the situation above, where A is running on two, I can kill A from the
CentOS command line in two. Shortly after doing so, Pacemaker invokes
myapp-script in two, in the following ways and returning the following
values:

   monitor: OCF_NOT_RUNNING
   stop: OCF_SUCCESS
   start: OCF_SUCCESS
   monitor: OCF_SUCCESS

After this, with ps auwx in two I can see that A is indeed up and running.
However, the output from pcs status (in either one or two) is now the
following:

Cluster name: FirstCluster
Stack: corosync
Current DC: two (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with
quorum
Last updated: Thu Apr 18 15:21:25 2019
Last change: Thu Apr 18 13:50:57 2019 by root via cibadmin on one

2 nodes configured
5 resources configured

Online: [ one two ]

Full list of resources:

 MyCluster  (ocf::myapp:myapp-script):  Started two
 Master/Slave Set: DrbdDataClone [DrbdData]
     Masters: [ two ]
     Slaves: [ one ]
 DrbdFS (ocf::heartbeat:Filesystem):  Started two
 disk_fencing (stonith:fence_scsi): Started one

Failed Actions:
* MyCluster_monitor_30000 on two 'not running' (7): call=35,
status=complete, exitreason='',
    last-rc-change='Thu Apr 18 15:21:12 2019', queued=0ms, exec=0ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

And the cluster seems to stay stuck there, until I stop and start node two
explicitly.

       Is this the expected behavior? What I was expecting is for Pacemaker
to restart A, in either node - which it indeed does, in two itself. But pcs
status seems to think that an error happened when trying to restart A -
despite the fact that it got A restarted all right. And I know that A is
running correctly to boot.

       What am I misunderstanding here?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190418/e2b14013/attachment.html>