[ClusterLabs] Pacemaker stopped monitoring the resource

Thu Aug 31 10:40:59 EDT 2017

On Thu, 2017-08-31 at 06:41 +0000, Abhay B wrote:
> Hi, 
> 
> 
> I have a 2 node HA cluster configured on CentOS 7 with pcs command. 
> 
> 
> Below are the properties of the cluster :
> 
> 
> # pcs property
> Cluster Properties:
>  cluster-infrastructure: corosync
>  cluster-name: SVSDEHA
>  cluster-recheck-interval: 2s
>  dc-deadtime: 5
>  dc-version: 1.1.15-11.el7_3.5-e174ec8
>  have-watchdog: false
>  last-lrm-refresh: 1504090367
>  no-quorum-policy: ignore
>  start-failure-is-fatal: false
>  stonith-enabled: false
> 
> 
> PFA the cib.
> Also attached is the corosync.log around the time the below issue
> happened.
> 
> 
> After around 10 hrs and multiple failures, pacemaker stops monitoring
> resource on one of the nodes in the cluster.
> 
> 
> So even though the resource on other node fails, it is never migrated
> to the node on which the resource is not monitored.
> 
> 
> Wanted to know what could have triggered this and how to avoid getting
> into such scenarios.
> I am going through the logs and couldn't find why this happened.
> 
> 
> After this log the monitoring stopped.   
> 
> Aug 29 11:01:44 [16500] TPC-D12-10-002.phaedrus.sandvine.com
> crmd:     info: process_lrm_event:   Result of monitor operation for
> SVSDEHA on TPC-D12-10-002.phaedrus.sandvine.com: 0 (ok) | call=538
> key=SVSDEHA_monitor_2000 confirmed=false cib-update=50013

Are you sure the monitor stopped? Pacemaker only logs recurring monitors
when the status changes. Any successful monitors after this wouldn't be
logged.

> Below log says the resource is leaving the cluster. 
> Aug 29 11:01:44 [16499] TPC-D12-10-002.phaedrus.sandvine.com
> pengine:     info: LogActions:  Leave   SVSDEHA:0       (Slave
> TPC-D12-10-002.phaedrus.sandvine.com)

This means that the cluster will leave the resource where it is (i.e. it
doesn't need a start, stop, move, demote, promote, etc.).

> Let me know if anything more is needed. 
> 
> 
> Regards,
> Abhay
> 
> 
> PS:'pcs resource cleanup' brought the cluster back into good state. 

There are a lot of resource action failures, so I'm not sure where the
issue is, but I'm guessing it has to do with migration-threshold=1 --
once a resource has failed once on a node, it won't be allowed back on
that node until the failure is cleaned up. Of course you also have
failure-timeout=1s, which should clean it up immediately, so I'm not
sure.

My gut feeling is that you're trying to do too many things at once. I'd
start over from scratch and proceed more slowly: first, set "two_node:
1" in corosync.conf and let no-quorum-policy default in pacemaker; then,
get stonith configured, tested, and enabled; then, test your resource
agent manually on the command line to make sure it conforms to the
expected return values
( http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#ap-ocf ); then add your resource to the cluster without migration-threshold or failure-timeout, and work out any issues with frequent failures; then finally set migration-threshold and failure-timeout to reflect how you want recovery to proceed.
-- 
Ken Gaillot <kgaillot at redhat.com>