[ClusterLabs] Failover event not reported correctly?

Thu Apr 18 19:00:48 EDT 2019

On Thu, 2019-04-18 at 15:51 -0600, JCA wrote:
> I have my CentOS two-node cluster, which some of you may already be
> sick and tired of reading about:
> 
> # pcs status
> Cluster name: FirstCluster
> Stack: corosync
> Current DC: two (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition
> with quorum
> Last updated: Thu Apr 18 13:52:38 2019
> Last change: Thu Apr 18 13:50:57 2019 by root via cibadmin on one
> 
> 2 nodes configured
> 5 resources configured
> 
> Online: [ one two ]
> 
> Full list of resources:
> 
>  MyCluster  (ocf::myapp:myapp-script):  Started two
>  Master/Slave Set: DrbdDataClone [DrbdData]
>      Masters: [ two ]
>      Slaves: [ one ]
>  DrbdFS (ocf::heartbeat:Filesystem):  Started two
>  disk_fencing (stonith:fence_scsi): Started one
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> 
> I can stop either node, and the other will take over as expected.
> Here is the thing though:
> 
> myapp-script starts, stops and monitors the actual application that I
> am interested in. I'll call this application A. At the OS level, A is
> of course listed when I do ps awux. 
> 
> In the situation above, where A is running on two, I can kill A from
> the CentOS command line in two. Shortly after doing so, Pacemaker
> invokes myapp-script in two, in the following ways and returning the
> following values:
> 
>    monitor: OCF_NOT_RUNNING
>    stop: OCF_SUCCESS
>    start: OCF_SUCCESS 
>    monitor: OCF_SUCCESS
>  
> After this, with ps auwx in two I can see that A is indeed up and
> running. However, the output from pcs status (in either one or two)
> is now the following:
> 
> Cluster name: FirstCluster
> Stack: corosync
> Current DC: two (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition
> with quorum
> Last updated: Thu Apr 18 15:21:25 2019
> Last change: Thu Apr 18 13:50:57 2019 by root via cibadmin on one
> 
> 2 nodes configured
> 5 resources configured
> 
> Online: [ one two ]
> 
> Full list of resources:
> 
>  MyCluster  (ocf::myapp:myapp-script):  Started two
>  Master/Slave Set: DrbdDataClone [DrbdData]
>      Masters: [ two ]
>      Slaves: [ one ]
>  DrbdFS (ocf::heartbeat:Filesystem):  Started two
>  disk_fencing (stonith:fence_scsi): Started one
> 
> Failed Actions:
> * MyCluster_monitor_30000 on two 'not running' (7): call=35,
> status=complete, exitreason='',
>     last-rc-change='Thu Apr 18 15:21:12 2019', queued=0ms, exec=0ms
> 
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> 
> And the cluster seems to stay stuck there, until I stop and start
> node two explicitly.
> 
>        Is this the expected behavior? What I was expecting is for
> Pacemaker to restart A, in either node - which it indeed does, in two
> itself. But pcs status seems to think that an error happened when
> trying to restart A - despite the fact that it got A restarted all
> right. And I know that A is running correctly to boot.
> 
>        What am I misunderstanding here?

You got everything right, except the display is not saying the restart
failed -- it's saying there was a monitor failure that led to the
restart. The "failed actions" section is a history rather than the
current status (which is the "full cluster status" section).

The idea is that failures might occur when you're not looking :) and
you can see that they happened the next time you check the status, even
if the cluster was able to recover successfully.

To clear the history, run "crm_resource -C -r MyCluster" (or "pcs
resource cleanup MyCluster" if you're using pcs).
-- 
Ken Gaillot <kgaillot at redhat.com>