[ClusterLabs] Continuous master monitor failure of a resource in case some other resource is being promoted

Mon Feb 25 03:50:48 EST 2019

Hi,

We have a bunch of resources running in master slave configuration with one
master and one slave instance running at any given time.

What we observe is, that for any two given resources at a time, if say
resource Stateful_Test_1 is in middle of doing a promote and it takes
significant amount of time (close to 150 seconds in our scenario) for it to
complete promote (like starting a web server) and, during this time, say
resource Stateful_Test_2's master instance fails, then the failure of
Stateful_Test_2 master is never honored by pengine and the monitor being
reoccurring keeps on failing without any action being taken by the DC.

We see below logs for the failure of Stateful_Test_2 in the DC which was
VM-3 at that time:

Feb 25 11:28:13 [6013] VM-3       crmd:   notice: abort_transition_graph:
    Transition aborted by operation Stateful_Test_2_monitor_17000 'create'
on VM-1: Old event | magic=0:9;329:8:8:4a2b407e-ad15-43d0-8248-e70f9f22436b
cib=0.191.5 source=process_graph_event:498 complete=false

As per our current testing, the Stateful_Test_2 resource has failed 590
times and it still continues to fail!! without the failure being processed
by pacemaker. We have to manually intervene to recover it by doing a
resource restart.

Could you please help me understand:
1. Why doesn't pacemaker process the failure of Stateful_Test_2 resource
immediately after first failure?
2. Why does the monitor failure of Stateful_Test_2 continue even after the
promote of Stateful_Test_1 has been completed? Shouldn't it handle
Stateful_Test_2's failure and take necessary action on it? It feels as if
that particular failure 'event' has been 'dropped' and pengine is not even
aware of the Stateful_Test_2's failure.

It's pretty straightforward to reproduce this issue.
I have attached the two dummy resource agents which we used to simulate our
scenario along with the commands used to configure the resource and ban it
on other VMs in the cluster.

Note: We have intentionally kept monitor intervals as low, contrary to the
suggestions since we want our failure detection to be faster as the
resources are critical to our component

Once the resources are configured, you need to issue following two commands
to reproduce the problem:
1. crm resource restart Stateful_Test_1
In another session tab, wherever Stateful_Test_2 is running as master, you
need to delete the marker file which is monitored as part of
Stateful_Test_2's master monitor.
In our case it was in VM-1 so I deleted the marker from there.
[root at VM-1~]
# rm -f /root/stateful_Test_2_Marker

Now if you check the logs in /root, you would see that Stateful_Test_2
would print failure logs back to back. Showing a sample from our current
Stateful_Test_2.log file.
# cat /root/Stateful_Test_2.log
Mon Feb 25 11:00:29 IST 2019 Inside promote for Stateful_Test_2!
Mon Feb 25 11:00:34 IST 2019 Promote for Stateful_Test_2 completed!
Mon Feb 25 11:28:13 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9
Mon Feb 25 11:28:30 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9
Mon Feb 25 11:28:47 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9
Mon Feb 25 11:29:04 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9
Mon Feb 25 11:29:21 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9
Mon Feb 25 11:29:38 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9
Mon Feb 25 11:29:55 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9
Mon Feb 25 11:30:12 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9
Mon Feb 25 11:30:29 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9
Mon Feb 25 11:30:46 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9
.
.
.
Mon Feb 25 14:15:08 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9
Mon Feb 25 14:15:25 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9
Mon Feb 25 14:15:42 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9
Mon Feb 25 14:15:59 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9

# pacemakerd --version
Pacemaker 1.1.18
Written by Andrew Beekhof

# corosync -v
Corosync Cluster Engine, version '2.4.2'
Copyright (c) 2006-2009 Red Hat, Inc.

Below is my cluster configuration:

node 1: VM-0
node 2: VM-1
node 3: VM-2
node 4: VM-3
node 5: VM-4
node 6: VM-5
primitive Stateful_Test_1 ocf:pacemaker:Stateful_Test_1 \
        op start timeout=200s interval=0 \
        op promote timeout=300s interval=0 \
        op monitor interval=15s role=Master timeout=30s \
        op monitor interval=20s role=Slave timeout=30s \
        op stop on-fail=restart interval=0 \
        meta resource-stickiness=100 migration-threshold=1
failure-timeout=15s
primitive Stateful_Test_2 ocf:pacemaker:Stateful_Test_2 \
        op start timeout=200s interval=0 \
        op promote timeout=300s interval=0 \
        op monitor interval=17s role=Master timeout=30s \
        op monitor interval=25s role=Slave timeout=30s \
        op stop on-fail=restart interval=0 \
        meta resource-stickiness=100 migration-threshold=1
failure-timeout=15s
ms StatefulTest1_MS Stateful_Test_1 \
        meta resource-stickiness=100 notify=true master-max=1
interleave=true target-role=Started
ms StatefulTest2_MS Stateful_Test_2 \
        meta resource-stickiness=100 notify=true master-max=1
interleave=true target-role=Started
location Stateful_Test_1_rule_2 StatefulTest1_MS \
        rule -inf: #uname eq VM-2
location Stateful_Test_1_rule_3 StatefulTest1_MS \
        rule -inf: #uname eq VM-3
location Stateful_Test_1_rule_4 StatefulTest1_MS \
        rule -inf: #uname eq VM-4
location Stateful_Test_1_rule_5 StatefulTest1_MS \
        rule -inf: #uname eq VM-5
location Stateful_Test_2_rule_2 StatefulTest2_MS \
        rule -inf: #uname eq VM-2
location Stateful_Test_2_rule_3 StatefulTest2_MS \
        rule -inf: #uname eq VM-3
location Stateful_Test_2_rule_4 StatefulTest2_MS \
        rule -inf: #uname eq VM-4
location Stateful_Test_2_rule_5 StatefulTest2_MS \
        rule -inf: #uname eq VM-5
property cib-bootstrap-options: \
        stonith-enabled=false \
        no-quorum-policy=ignore \
        cluster-recheck-interval=30s \
        start-failure-is-fatal=false \
        stop-all-resources=false \
        have-watchdog=false \
        dc-version=1.1.16-94ff4df51a \
        cluster-infrastructure=corosync \
        cluster-name=hacluster-0

Since the resource failure is never processed, it's a serious problem for
us, as it requires manual intervention to restart that resource.

Could you please help us in understanding this behavior and how to fix this?

Best Regards,
Samarth J
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190225/55fc764c/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Stateful_Test_1
Type: application/octet-stream
Size: 6725 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190225/55fc764c/attachment-0003.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Stateful_Test_2
Type: application/octet-stream
Size: 6722 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190225/55fc764c/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: configure_commands
Type: application/octet-stream
Size: 1647 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190225/55fc764c/attachment-0005.obj>