[ClusterLabs] Failover event not reported correctly?

Thu Apr 18 19:54:53 EDT 2019

Yep, that works fine. Thanks for the explanation.

On Thu, Apr 18, 2019 at 5:00 PM Ken Gaillot <kgaillot at redhat.com> wrote:

> On Thu, 2019-04-18 at 15:51 -0600, JCA wrote:
> > I have my CentOS two-node cluster, which some of you may already be
> > sick and tired of reading about:
> >
> > # pcs status
> > Cluster name: FirstCluster
> > Stack: corosync
> > Current DC: two (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition
> > with quorum
> > Last updated: Thu Apr 18 13:52:38 2019
> > Last change: Thu Apr 18 13:50:57 2019 by root via cibadmin on one
> >
> > 2 nodes configured
> > 5 resources configured
> >
> > Online: [ one two ]
> >
> > Full list of resources:
> >
> >  MyCluster  (ocf::myapp:myapp-script):  Started two
> >  Master/Slave Set: DrbdDataClone [DrbdData]
> >      Masters: [ two ]
> >      Slaves: [ one ]
> >  DrbdFS (ocf::heartbeat:Filesystem):  Started two
> >  disk_fencing (stonith:fence_scsi): Started one
> >
> > Daemon Status:
> >   corosync: active/enabled
> >   pacemaker: active/enabled
> >   pcsd: active/enabled
> >
> > I can stop either node, and the other will take over as expected.
> > Here is the thing though:
> >
> > myapp-script starts, stops and monitors the actual application that I
> > am interested in. I'll call this application A. At the OS level, A is
> > of course listed when I do ps awux.
> >
> > In the situation above, where A is running on two, I can kill A from
> > the CentOS command line in two. Shortly after doing so, Pacemaker
> > invokes myapp-script in two, in the following ways and returning the
> > following values:
> >
> >    monitor: OCF_NOT_RUNNING
> >    stop: OCF_SUCCESS
> >    start: OCF_SUCCESS
> >    monitor: OCF_SUCCESS
> >
> > After this, with ps auwx in two I can see that A is indeed up and
> > running. However, the output from pcs status (in either one or two)
> > is now the following:
> >
> > Cluster name: FirstCluster
> > Stack: corosync
> > Current DC: two (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition
> > with quorum
> > Last updated: Thu Apr 18 15:21:25 2019
> > Last change: Thu Apr 18 13:50:57 2019 by root via cibadmin on one
> >
> > 2 nodes configured
> > 5 resources configured
> >
> > Online: [ one two ]
> >
> > Full list of resources:
> >
> >  MyCluster  (ocf::myapp:myapp-script):  Started two
> >  Master/Slave Set: DrbdDataClone [DrbdData]
> >      Masters: [ two ]
> >      Slaves: [ one ]
> >  DrbdFS (ocf::heartbeat:Filesystem):  Started two
> >  disk_fencing (stonith:fence_scsi): Started one
> >
> > Failed Actions:
> > * MyCluster_monitor_30000 on two 'not running' (7): call=35,
> > status=complete, exitreason='',
> >     last-rc-change='Thu Apr 18 15:21:12 2019', queued=0ms, exec=0ms
> >
> >
> > Daemon Status:
> >   corosync: active/enabled
> >   pacemaker: active/enabled
> >   pcsd: active/enabled
> >
> > And the cluster seems to stay stuck there, until I stop and start
> > node two explicitly.
> >
> >        Is this the expected behavior? What I was expecting is for
> > Pacemaker to restart A, in either node - which it indeed does, in two
> > itself. But pcs status seems to think that an error happened when
> > trying to restart A - despite the fact that it got A restarted all
> > right. And I know that A is running correctly to boot.
> >
> >        What am I misunderstanding here?
>
> You got everything right, except the display is not saying the restart
> failed -- it's saying there was a monitor failure that led to the
> restart. The "failed actions" section is a history rather than the
> current status (which is the "full cluster status" section).
>
> The idea is that failures might occur when you're not looking :) and
> you can see that they happened the next time you check the status, even
> if the cluster was able to recover successfully.
>
> To clear the history, run "crm_resource -C -r MyCluster" (or "pcs
> resource cleanup MyCluster" if you're using pcs).
> --
> Ken Gaillot <kgaillot at redhat.com>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190418/7306236c/attachment-0001.html>