[ClusterLabs] Pacemaker stopped monitoring the resource

Fri Sep 1 05:36:19 EDT 2017

>
> Are you sure the monitor stopped? Pacemaker only logs recurring monitors
> when the status changes. Any successful monitors after this wouldn't be
> logged.

Yes. Since there  were no logs which said "RecurringOp:  Start recurring
monitor" on the node after it had failed.
Also there were no logs for any actions pertaining to
The problem was that even though the one node was failing, the resources
were never moved to the other node(the node on which I suspect monitoring
had stopped).

There are a lot of resource action failures, so I'm not sure where the
> issue is, but I'm guessing it has to do with migration-threshold=1 --
> once a resource has failed once on a node, it won't be allowed back on
> that node until the failure is cleaned up. Of course you also have
> failure-timeout=1s, which should clean it up immediately, so I'm not
> sure.

migration-threshold=1
failure-timeout=1s
cluster-recheck-interval=2s

first, set "two_node:
> 1" in corosync.conf and let no-quorum-policy default in pacemaker

This is already configured.
# cat /etc/corosync/corosync.conf
totem {
    version: 2
    secauth: off
    cluster_name: SVSDEHA
    transport: udpu
    token: 5000
}

nodelist {
    node {
        ring0_addr: 2.0.0.10
        nodeid: 1
    }

    node {
        ring0_addr: 2.0.0.11
        nodeid: 2
    }
}

quorum {
    provider: corosync_votequorum
    *two_node: 1*
}

logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
}

let no-quorum-policy default in pacemaker; then,
> get stonith configured, tested, and enabled

By not configuring no-quorum-policy, would it ignore quorum for a 2 node
cluster?
For my use case I don't need stonith enabled. My intention is to have a
highly available system all the time.
I will test my RA again as suggested with no-quorum-policy=default.

One more doubt.
Why do we see this is 'pcs property' ?
last-lrm-refresh: 1504090367

Never seen this on a healthy cluster.
>From RHEL documentation:
last-lrm-refresh
Last refresh of the Local Resource Manager, given in units of seconds since
epoca. Used for diagnostic purposes; not user-configurable.

Doesn't explain much.

Also. does avg. CPU load impact resource monitoring ?

Regards,
Abhay

On Thu, 31 Aug 2017 at 20:11 Ken Gaillot <kgaillot at redhat.com> wrote:

> On Thu, 2017-08-31 at 06:41 +0000, Abhay B wrote:
> > Hi,
> >
> >
> > I have a 2 node HA cluster configured on CentOS 7 with pcs command.
> >
> >
> > Below are the properties of the cluster :
> >
> >
> > # pcs property
> > Cluster Properties:
> >  cluster-infrastructure: corosync
> >  cluster-name: SVSDEHA
> >  cluster-recheck-interval: 2s
> >  dc-deadtime: 5
> >  dc-version: 1.1.15-11.el7_3.5-e174ec8
> >  have-watchdog: false
> >  last-lrm-refresh: 1504090367
> >  no-quorum-policy: ignore
> >  start-failure-is-fatal: false
> >  stonith-enabled: false
> >
> >
> > PFA the cib.
> > Also attached is the corosync.log around the time the below issue
> > happened.
> >
> >
> > After around 10 hrs and multiple failures, pacemaker stops monitoring
> > resource on one of the nodes in the cluster.
> >
> >
> > So even though the resource on other node fails, it is never migrated
> > to the node on which the resource is not monitored.
> >
> >
> > Wanted to know what could have triggered this and how to avoid getting
> > into such scenarios.
> > I am going through the logs and couldn't find why this happened.
> >
> >
> > After this log the monitoring stopped.
> >
> > Aug 29 11:01:44 [16500] TPC-D12-10-002.phaedrus.sandvine.com
> > crmd:     info: process_lrm_event:   Result of monitor operation for
> > SVSDEHA on TPC-D12-10-002.phaedrus.sandvine.com: 0 (ok) | call=538
> > key=SVSDEHA_monitor_2000 confirmed=false cib-update=50013
>
> Are you sure the monitor stopped? Pacemaker only logs recurring monitors
> when the status changes. Any successful monitors after this wouldn't be
> logged.
>
> > Below log says the resource is leaving the cluster.
> > Aug 29 11:01:44 [16499] TPC-D12-10-002.phaedrus.sandvine.com
> > pengine:     info: LogActions:  Leave   SVSDEHA:0       (Slave
> > TPC-D12-10-002.phaedrus.sandvine.com)
>
> This means that the cluster will leave the resource where it is (i.e. it
> doesn't need a start, stop, move, demote, promote, etc.).
>
> > Let me know if anything more is needed.
> >
> >
> > Regards,
> > Abhay
> >
> >
> > PS:'pcs resource cleanup' brought the cluster back into good state.
>
> There are a lot of resource action failures, so I'm not sure where the
> issue is, but I'm guessing it has to do with migration-threshold=1 --
> once a resource has failed once on a node, it won't be allowed back on
> that node until the failure is cleaned up. Of course you also have
> failure-timeout=1s, which should clean it up immediately, so I'm not
> sure.
>
> My gut feeling is that you're trying to do too many things at once. I'd
> start over from scratch and proceed more slowly: first, set "two_node:
> 1" in corosync.conf and let no-quorum-policy default in pacemaker; then,
> get stonith configured, tested, and enabled; then, test your resource
> agent manually on the command line to make sure it conforms to the
> expected return values
> ( http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-
> single/Pacemaker_Explained/index.html#ap-ocf ); then add your resource to
> the cluster without migration-threshold or failure-timeout, and work out
> any issues with frequent failures; then finally set migration-threshold and
> failure-timeout to reflect how you want recovery to proceed.
> --
> Ken Gaillot <kgaillot at redhat.com>
>
>
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20170901/4fe59e36/attachment-0002.html>