[ClusterLabs] Continuous master monitor failure of a resource in case some other resource is being promoted

Ken Gaillot kgaillot at redhat.com
Mon Feb 25 15:13:49 EST 2019


On Mon, 2019-02-25 at 14:20 +0530, Samarth Jain wrote:
> Hi,
> 
> 
> We have a bunch of resources running in master slave configuration
> with one master and one slave instance running at any given time.
> 
> What we observe is, that for any two given resources at a time, if
> say resource Stateful_Test_1 is in middle of doing a promote and it
> takes significant amount of time (close to 150 seconds in our
> scenario) for it to complete promote (like starting a web server)
> and, during this time, say resource Stateful_Test_2's master instance
> fails, then the failure of Stateful_Test_2 master is never honored by
> pengine and the monitor being reoccurring keeps on failing without
> any action being taken by the DC.
> 
> We see below logs for the failure of Stateful_Test_2 in the DC which
> was VM-3 at that time:
> 
> Feb 25 11:28:13 [6013] VM-3       crmd:   notice:
> abort_transition_graph:      Transition aborted by operation
> Stateful_Test_2_monitor_17000 'create' on VM-1: Old event |
> magic=0:9;329:8:8:4a2b407e-ad15-43d0-8248-e70f9f22436b cib=0.191.5
> source=process_graph_event:498 complete=false
> 
> As per our current testing, the Stateful_Test_2 resource has failed
> 590 times and it still continues to fail!! without the failure being
> processed by pacemaker. We have to manually intervene to recover it
> by doing a resource restart.
> 
> Could you please help me understand:
> 1. Why doesn't pacemaker process the failure of Stateful_Test_2
> resource immediately after first failure?

All actions that have already been initiated must complete before the
cluster can react to new conditions. The outcome of those actions can
(and likely will) affect what needs to be done, so the cluster has to
wait for them. The action timeouts are the only way to really affect
this.

We've discussed the theoretical possibility of figuring out what would
have to be done regardless of the outcome of the in-flight actions, but
that might be computationally impractical.

> 2. Why does the monitor failure of Stateful_Test_2 continue even
> after the promote of Stateful_Test_1 has been completed? Shouldn't it
> handle Stateful_Test_2's failure and take necessary action on it? It
> feels as if that particular failure 'event' has been 'dropped' and
> pengine is not even aware of the Stateful_Test_2's failure.

This is a serious problem. Can you open a bug report at
bugs.clusterlabs.org, and attach (or email me privately) the output of
crm_report for the time of interest?

> It's pretty straightforward to reproduce this issue.
> I have attached the two dummy resource agents which we used to
> simulate our scenario along with the commands used to configure the
> resource and ban it on other VMs in the cluster.
> 
> Note: We have intentionally kept monitor intervals as low, contrary
> to the suggestions since we want our failure detection to be faster
> as the resources are critical to our component
> 
> Once the resources are configured, you need to issue following two
> commands to reproduce the problem:
> 1. crm resource restart Stateful_Test_1
> In another session tab, wherever Stateful_Test_2 is running as
> master, you need to delete the marker file which is monitored as part
> of Stateful_Test_2's master monitor.
> In our case it was in VM-1 so I deleted the marker from there.
> [root at VM-1~]
> # rm -f /root/stateful_Test_2_Marker
> 
> Now if you check the logs in /root, you would see that
> Stateful_Test_2 would print failure logs back to back. Showing a
> sample from our current Stateful_Test_2.log file.
> # cat /root/Stateful_Test_2.log
> Mon Feb 25 11:00:29 IST 2019 Inside promote for Stateful_Test_2!
> Mon Feb 25 11:00:34 IST 2019 Promote for Stateful_Test_2 completed!
> Mon Feb 25 11:28:13 IST 2019 Master monitor failed for
> Stateful_Test_2. Returning 9
> Mon Feb 25 11:28:30 IST 2019 Master monitor failed for
> Stateful_Test_2. Returning 9
> Mon Feb 25 11:28:47 IST 2019 Master monitor failed for
> Stateful_Test_2. Returning 9
> Mon Feb 25 11:29:04 IST 2019 Master monitor failed for
> Stateful_Test_2. Returning 9
> Mon Feb 25 11:29:21 IST 2019 Master monitor failed for
> Stateful_Test_2. Returning 9
> Mon Feb 25 11:29:38 IST 2019 Master monitor failed for
> Stateful_Test_2. Returning 9
> Mon Feb 25 11:29:55 IST 2019 Master monitor failed for
> Stateful_Test_2. Returning 9
> Mon Feb 25 11:30:12 IST 2019 Master monitor failed for
> Stateful_Test_2. Returning 9
> Mon Feb 25 11:30:29 IST 2019 Master monitor failed for
> Stateful_Test_2. Returning 9
> Mon Feb 25 11:30:46 IST 2019 Master monitor failed for
> Stateful_Test_2. Returning 9
> .
> .
> .
> Mon Feb 25 14:15:08 IST 2019 Master monitor failed for
> Stateful_Test_2. Returning 9
> Mon Feb 25 14:15:25 IST 2019 Master monitor failed for
> Stateful_Test_2. Returning 9
> Mon Feb 25 14:15:42 IST 2019 Master monitor failed for
> Stateful_Test_2. Returning 9
> Mon Feb 25 14:15:59 IST 2019 Master monitor failed for
> Stateful_Test_2. Returning 9
> 
> # pacemakerd --version
> Pacemaker 1.1.18
> Written by Andrew Beekhof
>         
> # corosync -v
> Corosync Cluster Engine, version '2.4.2'
> Copyright (c) 2006-2009 Red Hat, Inc.
>         
> Below is my cluster configuration:
>         
> node 1: VM-0
> node 2: VM-1
> node 3: VM-2
> node 4: VM-3
> node 5: VM-4
> node 6: VM-5
> primitive Stateful_Test_1 ocf:pacemaker:Stateful_Test_1 \
>         op start timeout=200s interval=0 \
>         op promote timeout=300s interval=0 \
>         op monitor interval=15s role=Master timeout=30s \
>         op monitor interval=20s role=Slave timeout=30s \
>         op stop on-fail=restart interval=0 \
>         meta resource-stickiness=100 migration-threshold=1 failure-
> timeout=15s
> primitive Stateful_Test_2 ocf:pacemaker:Stateful_Test_2 \
>         op start timeout=200s interval=0 \
>         op promote timeout=300s interval=0 \
>         op monitor interval=17s role=Master timeout=30s \
>         op monitor interval=25s role=Slave timeout=30s \
>         op stop on-fail=restart interval=0 \
>         meta resource-stickiness=100 migration-threshold=1 failure-
> timeout=15s
> ms StatefulTest1_MS Stateful_Test_1 \
>         meta resource-stickiness=100 notify=true master-max=1
> interleave=true target-role=Started
> ms StatefulTest2_MS Stateful_Test_2 \
>         meta resource-stickiness=100 notify=true master-max=1
> interleave=true target-role=Started
> location Stateful_Test_1_rule_2 StatefulTest1_MS \
>         rule -inf: #uname eq VM-2
> location Stateful_Test_1_rule_3 StatefulTest1_MS \
>         rule -inf: #uname eq VM-3
> location Stateful_Test_1_rule_4 StatefulTest1_MS \
>         rule -inf: #uname eq VM-4
> location Stateful_Test_1_rule_5 StatefulTest1_MS \
>         rule -inf: #uname eq VM-5
> location Stateful_Test_2_rule_2 StatefulTest2_MS \
>         rule -inf: #uname eq VM-2
> location Stateful_Test_2_rule_3 StatefulTest2_MS \
>         rule -inf: #uname eq VM-3
> location Stateful_Test_2_rule_4 StatefulTest2_MS \
>         rule -inf: #uname eq VM-4
> location Stateful_Test_2_rule_5 StatefulTest2_MS \
>         rule -inf: #uname eq VM-5
> property cib-bootstrap-options: \
>         stonith-enabled=false \
>         no-quorum-policy=ignore \
>         cluster-recheck-interval=30s \
>         start-failure-is-fatal=false \
>         stop-all-resources=false \
>         have-watchdog=false \
>         dc-version=1.1.16-94ff4df51a \
>         cluster-infrastructure=corosync \
>         cluster-name=hacluster-0
> 
> Since the resource failure is never processed, it's a serious problem
> for us, as it requires manual intervention to restart that resource.
> 
> Could you please help us in understanding this behavior and how to
> fix this?
> 
> 
> Best Regards,
> Samarth J
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org




More information about the Users mailing list