[ClusterLabs] Monitor being called repeatedly for Master/Slave resource despite monitor returning failure

Mon Feb 19 12:36:22 EST 2018

On Mon, 2018-02-19 at 16:48 +0530, Pankaj wrote:
> Hi,
> 
> 
> I have configured wildfly resource in master slave mode on a 6 VM
> cluster with stonith disabled and and no quorum policy set to ignore.

To some of us that sounds like "I'm driving a car with no brakes ..."
:-)

Without stonith or quorum, there's a high risk of split-brain. Any node
that gets cut off from the others will start all the resources.

> We are observing that on either of master or slave resource failure,
> pacemaker keeps on calling stateful_monitor for wildfly repeatedly,
> despite us returning appropriate failure return codes on monitor
> failure for both master (failure rc=OCF_MASTER_FAILED) and slave
> (failure rc=OCF_NOT_RUNNING).

With your configuration, after the first monitor failure, it should try
to stop the resource, start it again, then monitor it.

One of the nodes at any time is elected the DC. This node will run the
policy engine to make decisions about what needs to be done. The logs
from that node will be most helpful.

Look for the time the failure occurred; once the cluster detects the
failure, there should be a bunch of lines from "pengine" ending in
"Calculated transition" -- these will show what actions were decided.

After that, there will be lines from "crmd" showing "Initiating" and
"Result of" those actions.

> This continues till failure-timeout is reached after which the
> resource gets demoted and stopped in case of master monitor failure,
> and stopped in case of slave monitor failure.
> 
> Could you please help me understand:
> Why don't pacemaker demotes or stops resource immediately after first
> failure, and keeps calling monitor ?
> 
> # pacemakerd --version
> Pacemaker 1.1.16
> Written by Andrew Beekhof
> 
> # corosync -v
> Corosync Cluster Engine, version '2.4.2'
> Copyright (c) 2006-2009 Red Hat, Inc.
> 
> Below is my configuration:
> 
> node 1: VM-0
> node 2: VM-1
> node 3: VM-2
> node 4: VM-3
> node 5: VM-4
> node 6: VM-5
> primitive stateful_wildfly ocf:pacemaker:wildfly \
>         op start timeout=200s interval=0 \
>         op promote timeout=300s interval=0 \
>         op monitor interval=90s role=Master timeout=90s \
>         op monitor interval=80s role=Slave timeout=100s \
>         meta resource-stickiness=100 migration-threshold=3 failure-
> timeout=240s
> ms wildfly_MS stateful_wildfly \
> location stateful_wildfly_rule_2 wildfly_MS \
>         rule -inf: #uname eq VM-2
> location stateful_wildfly_rule_3 wildfly_MS \
>         rule -inf: #uname eq VM-3
> location stateful_wildfly_rule_4 wildfly_MS \
>         rule -inf: #uname eq VM-4
> location stateful_wildfly_rule_5 wildfly_MS \
>         rule -inf: #uname eq VM-5
> property cib-bootstrap-options: \
>         stonith-enabled=false \
>         no-quorum-policy=ignore \
>         cluster-recheck-interval=30s \
>         start-failure-is-fatal=false \
>         stop-all-resources=false \
>         have-watchdog=false \
>         dc-version=1.1.16-94ff4df51a \
>         cluster-infrastructure=corosync \
>         cluster-name=hacluster-0
> 
> Could you please help us in understanding this behavior and how to
> fix this?
> 
> Regards,
> Pankaj
-- 
Ken Gaillot <kgaillot at redhat.com>