[ClusterLabs] Failover event not reported correctly?
    JCA 
    1.41421 at gmail.com
       
    Thu Apr 18 17:51:37 EDT 2019
    
    
  
I have my CentOS two-node cluster, which some of you may already be sick
and tired of reading about:
# pcs status
Cluster name: FirstCluster
Stack: corosync
Current DC: two (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with
quorum
Last updated: Thu Apr 18 13:52:38 2019
Last change: Thu Apr 18 13:50:57 2019 by root via cibadmin on one
2 nodes configured
5 resources configured
Online: [ one two ]
Full list of resources:
 MyCluster  (ocf::myapp:myapp-script):  Started two
 Master/Slave Set: DrbdDataClone [DrbdData]
     Masters: [ two ]
     Slaves: [ one ]
 DrbdFS (ocf::heartbeat:Filesystem):  Started two
 disk_fencing (stonith:fence_scsi): Started one
Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
I can stop either node, and the other will take over as expected. Here is
the thing though:
myapp-script starts, stops and monitors the actual application that I am
interested in. I'll call this application A. At the OS level, A is of
course listed when I do ps awux.
In the situation above, where A is running on two, I can kill A from the
CentOS command line in two. Shortly after doing so, Pacemaker invokes
myapp-script in two, in the following ways and returning the following
values:
   monitor: OCF_NOT_RUNNING
   stop: OCF_SUCCESS
   start: OCF_SUCCESS
   monitor: OCF_SUCCESS
After this, with ps auwx in two I can see that A is indeed up and running.
However, the output from pcs status (in either one or two) is now the
following:
Cluster name: FirstCluster
Stack: corosync
Current DC: two (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with
quorum
Last updated: Thu Apr 18 15:21:25 2019
Last change: Thu Apr 18 13:50:57 2019 by root via cibadmin on one
2 nodes configured
5 resources configured
Online: [ one two ]
Full list of resources:
 MyCluster  (ocf::myapp:myapp-script):  Started two
 Master/Slave Set: DrbdDataClone [DrbdData]
     Masters: [ two ]
     Slaves: [ one ]
 DrbdFS (ocf::heartbeat:Filesystem):  Started two
 disk_fencing (stonith:fence_scsi): Started one
Failed Actions:
* MyCluster_monitor_30000 on two 'not running' (7): call=35,
status=complete, exitreason='',
    last-rc-change='Thu Apr 18 15:21:12 2019', queued=0ms, exec=0ms
Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
And the cluster seems to stay stuck there, until I stop and start node two
explicitly.
       Is this the expected behavior? What I was expecting is for Pacemaker
to restart A, in either node - which it indeed does, in two itself. But pcs
status seems to think that an error happened when trying to restart A -
despite the fact that it got A restarted all right. And I know that A is
running correctly to boot.
       What am I misunderstanding here?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190418/e2b14013/attachment.html>
    
    
More information about the Users
mailing list