[ClusterLabs] Failover event not reported correctly?
Ken Gaillot
kgaillot at redhat.com
Thu Apr 18 19:00:48 EDT 2019
On Thu, 2019-04-18 at 15:51 -0600, JCA wrote:
> I have my CentOS two-node cluster, which some of you may already be
> sick and tired of reading about:
>
> # pcs status
> Cluster name: FirstCluster
> Stack: corosync
> Current DC: two (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition
> with quorum
> Last updated: Thu Apr 18 13:52:38 2019
> Last change: Thu Apr 18 13:50:57 2019 by root via cibadmin on one
>
> 2 nodes configured
> 5 resources configured
>
> Online: [ one two ]
>
> Full list of resources:
>
> MyCluster (ocf::myapp:myapp-script): Started two
> Master/Slave Set: DrbdDataClone [DrbdData]
> Masters: [ two ]
> Slaves: [ one ]
> DrbdFS (ocf::heartbeat:Filesystem): Started two
> disk_fencing (stonith:fence_scsi): Started one
>
> Daemon Status:
> corosync: active/enabled
> pacemaker: active/enabled
> pcsd: active/enabled
>
> I can stop either node, and the other will take over as expected.
> Here is the thing though:
>
> myapp-script starts, stops and monitors the actual application that I
> am interested in. I'll call this application A. At the OS level, A is
> of course listed when I do ps awux.
>
> In the situation above, where A is running on two, I can kill A from
> the CentOS command line in two. Shortly after doing so, Pacemaker
> invokes myapp-script in two, in the following ways and returning the
> following values:
>
> monitor: OCF_NOT_RUNNING
> stop: OCF_SUCCESS
> start: OCF_SUCCESS
> monitor: OCF_SUCCESS
>
> After this, with ps auwx in two I can see that A is indeed up and
> running. However, the output from pcs status (in either one or two)
> is now the following:
>
> Cluster name: FirstCluster
> Stack: corosync
> Current DC: two (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition
> with quorum
> Last updated: Thu Apr 18 15:21:25 2019
> Last change: Thu Apr 18 13:50:57 2019 by root via cibadmin on one
>
> 2 nodes configured
> 5 resources configured
>
> Online: [ one two ]
>
> Full list of resources:
>
> MyCluster (ocf::myapp:myapp-script): Started two
> Master/Slave Set: DrbdDataClone [DrbdData]
> Masters: [ two ]
> Slaves: [ one ]
> DrbdFS (ocf::heartbeat:Filesystem): Started two
> disk_fencing (stonith:fence_scsi): Started one
>
> Failed Actions:
> * MyCluster_monitor_30000 on two 'not running' (7): call=35,
> status=complete, exitreason='',
> last-rc-change='Thu Apr 18 15:21:12 2019', queued=0ms, exec=0ms
>
>
> Daemon Status:
> corosync: active/enabled
> pacemaker: active/enabled
> pcsd: active/enabled
>
> And the cluster seems to stay stuck there, until I stop and start
> node two explicitly.
>
> Is this the expected behavior? What I was expecting is for
> Pacemaker to restart A, in either node - which it indeed does, in two
> itself. But pcs status seems to think that an error happened when
> trying to restart A - despite the fact that it got A restarted all
> right. And I know that A is running correctly to boot.
>
> What am I misunderstanding here?
You got everything right, except the display is not saying the restart
failed -- it's saying there was a monitor failure that led to the
restart. The "failed actions" section is a history rather than the
current status (which is the "full cluster status" section).
The idea is that failures might occur when you're not looking :) and
you can see that they happened the next time you check the status, even
if the cluster was able to recover successfully.
To clear the history, run "crm_resource -C -r MyCluster" (or "pcs
resource cleanup MyCluster" if you're using pcs).
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list