[ClusterLabs] Failover event not reported correctly?
JCA
1.41421 at gmail.com
Thu Apr 18 17:51:37 EDT 2019
I have my CentOS two-node cluster, which some of you may already be sick
and tired of reading about:
# pcs status
Cluster name: FirstCluster
Stack: corosync
Current DC: two (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with
quorum
Last updated: Thu Apr 18 13:52:38 2019
Last change: Thu Apr 18 13:50:57 2019 by root via cibadmin on one
2 nodes configured
5 resources configured
Online: [ one two ]
Full list of resources:
MyCluster (ocf::myapp:myapp-script): Started two
Master/Slave Set: DrbdDataClone [DrbdData]
Masters: [ two ]
Slaves: [ one ]
DrbdFS (ocf::heartbeat:Filesystem): Started two
disk_fencing (stonith:fence_scsi): Started one
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
I can stop either node, and the other will take over as expected. Here is
the thing though:
myapp-script starts, stops and monitors the actual application that I am
interested in. I'll call this application A. At the OS level, A is of
course listed when I do ps awux.
In the situation above, where A is running on two, I can kill A from the
CentOS command line in two. Shortly after doing so, Pacemaker invokes
myapp-script in two, in the following ways and returning the following
values:
monitor: OCF_NOT_RUNNING
stop: OCF_SUCCESS
start: OCF_SUCCESS
monitor: OCF_SUCCESS
After this, with ps auwx in two I can see that A is indeed up and running.
However, the output from pcs status (in either one or two) is now the
following:
Cluster name: FirstCluster
Stack: corosync
Current DC: two (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with
quorum
Last updated: Thu Apr 18 15:21:25 2019
Last change: Thu Apr 18 13:50:57 2019 by root via cibadmin on one
2 nodes configured
5 resources configured
Online: [ one two ]
Full list of resources:
MyCluster (ocf::myapp:myapp-script): Started two
Master/Slave Set: DrbdDataClone [DrbdData]
Masters: [ two ]
Slaves: [ one ]
DrbdFS (ocf::heartbeat:Filesystem): Started two
disk_fencing (stonith:fence_scsi): Started one
Failed Actions:
* MyCluster_monitor_30000 on two 'not running' (7): call=35,
status=complete, exitreason='',
last-rc-change='Thu Apr 18 15:21:12 2019', queued=0ms, exec=0ms
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
And the cluster seems to stay stuck there, until I stop and start node two
explicitly.
Is this the expected behavior? What I was expecting is for Pacemaker
to restart A, in either node - which it indeed does, in two itself. But pcs
status seems to think that an error happened when trying to restart A -
despite the fact that it got A restarted all right. And I know that A is
running correctly to boot.
What am I misunderstanding here?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190418/e2b14013/attachment.html>
More information about the Users
mailing list