[ClusterLabs] [Problem] Thel fail-over is completed without the stop of the resource being carried out.

Mon Sep 26 13:23:10 UTC 2016

Hi All,

We discovered a problem in the cluster which Quorum control and STONITH did not have.

We can confirm the problem in the next procedure.

Step1) Constitute a cluster.

[root at rh72-01 ~]# crm configure load update trac3437.crm 

[root at rh72-01 ~]# crm_mon -1 -Af
Stack: corosync
Current DC: rh72-01 (version 1.1.15-e174ec8) - partition with quorum
Last updated: Mon Sep 26 13:00:22 2016          Last change: Mon Sep 26 12:59:52 2016 by root via cibadmin on rh72-01

2 nodes and 1 resource configured

Online: [ rh72-01 rh72-02 ]

Resource Group: grpDummy
prmDummy   (ocf::pacemaker:Dummy): Started rh72-01

Node Attributes:
* Node rh72-01:
* Node rh72-02:

Migration Summary:
* Node rh72-01:
* Node rh72-02:

Step2) Edit Dummy resource to cause stop trouble.

(snip)
dummy_stop() {
return $OCF_ERR_GENERIC
dummy_monitor    if [ $? -eq $OCF_SUCCESS ]; then        rm ${OCF_RESKEY_state}    fi
rm -f "${VERIFY_SERIALIZED_FILE}"
return $OCF_SUCCESS
}
(snip)

Step3) Stop Pacemaker of the node. Stop trouble happens.
[root at rh72-01 ~]# systemctl stop pacemaker

[root at rh72-01 ~]# crm_mon -1 -Af
Stack: corosync
Current DC: rh72-01 (version 1.1.15-e174ec8) - partition with quorum
Last updated: Mon Sep 26 13:01:33 2016          Last change: Mon Sep 26 12:59:52 2016 by root via cibadmin on rh72-01

2 nodes and 1 resource configured

Online: [ rh72-01 rh72-02 ]

Resource Group: grpDummy
prmDummy   (ocf::pacemaker:Dummy): FAILED rh72-01 (blocked)

Node Attributes:
* Node rh72-01:
* Node rh72-02:

Migration Summary:
* Node rh72-01:
prmDummy: migration-threshold=1 fail-count=1000000 last-failure='Mon Sep 26 13:01:18 2016'
* Node rh72-02:

Failed Actions:
* prmDummy_stop_0 on rh72-01 'unknown error' (1): call=8, status=complete, exitreason='none',
last-rc-change='Mon Sep 26 13:01:18 2016', queued=0ms, exec=33ms

Step4) Correct Dummy resource in the original.
(snip)
dummy_stop() {
dummy_monitor
if [ $? -eq $OCF_SUCCESS ]; then
rm ${OCF_RESKEY_state}
fi
rm -f "${VERIFY_SERIALIZED_FILE}"
return $OCF_SUCCESS
}
(snip)

Step5) Clean up does the trouble of the Dummy resource.

[root at rh72-01 ~]# crm_resource -C -r prmDummy -H rh72-01 -f
Cleaning up prmDummy on rh72-01, removing fail-count-prmDummy
Waiting for 1 replies from the CRMd. OK

Step6) Fail-over is completed. However, the stop of the Dummy resource is not carried out in rh72-01 node.

[root at rh72-02 ~]# crm_mon -1 -Af
Stack: corosync
Current DC: rh72-02 (version 1.1.15-e174ec8) - partition WITHOUT quorum
Last updated: Mon Sep 26 13:02:32 2016          Last change: Mon Sep 26 13:02:20 2016 by hacluster via crmd on rh72-01

2 nodes and 1 resource configured

Online: [ rh72-02 ]
OFFLINE: [ rh72-01 ]

Resource Group: grpDummy
prmDummy   (ocf::pacemaker:Dummy): Started rh72-02

Node Attributes:
* Node rh72-02:

Migration Summary:
* Node rh72-02:

[root at rh72-01 ~]# ls -lt /var/run/Dummy-prmDummy.state 
-rw-r-----. 1 root root 0  9月 26  2016 /var/run/Dummy-prmDummy.state
-------------
Sep 26 13:02:21 rh72-01 crmd[1584]: warning: Action 2 (prmDummy_monitor_0) on rh72-01 failed (target: 7 vs. rc: 0): Error
Sep 26 13:02:21 rh72-01 crmd[1584]: notice: Transition aborted by operation prmDummy_monitor_0 'create' on rh72-01: Event failed | magic=0:0;2:6:7:196faae4-4faf-42a5-9ffb-9dcf6272e3fb cib=0.6.2 source=match_graph_event:310 complete=false
Sep 26 13:02:21 rh72-01 crmd[1584]: info: Action prmDummy_monitor_0 (2) confirmed on rh72-01 (rc=0)
Sep 26 13:02:21 rh72-01 crmd[1584]: info: Detected action (6.2) prmDummy_monitor_0.13=ok: failed
Sep 26 13:02:21 rh72-01 crmd[1584]: warning: Action 2 (prmDummy_monitor_0) on rh72-01 failed (target: 7 vs. rc: 0): Error
Sep 26 13:02:21 rh72-01 crmd[1584]: info: Transition aborted by operation prmDummy_monitor_0 'create' on rh72-01: Event failed | magic=0:0;2:6:7:196faae4-4faf-42a5-9ffb-9dcf6272e3fb cib=0.6.2 source=match_graph_event:310 complete=false
Sep 26 13:02:21 rh72-01 crmd[1584]: info: Action prmDummy_monitor_0 (2) confirmed on rh72-01 (rc=0)
Sep 26 13:02:21 rh72-01 crmd[1584]: info: Detected action (6.2) prmDummy_monitor_0.13=ok: failed
Sep 26 13:02:21 rh72-01 crmd[1584]: notice: Transition 6 (Complete=3, Pending=0, Fired=0, Skipped=0, Incomplete=3, Source=/var/lib/pacemaker/pengine/pe-input-6.bz2): Complete
Sep 26 13:02:21 rh72-01 crmd[1584]: info: Input I_STOP received in state S_TRANSITION_ENGINE from notify_crmd
Sep 26 13:02:21 rh72-01 crmd[1584]: info: State transition S_TRANSITION_ENGINE -> S_STOPPING | input=I_STOP cause=C_FSA_INTERNAL origin=notify_crmd
Sep 26 13:02:21 rh72-01 crmd[1584]: info: DC role released
Sep 26 13:02:21 rh72-01 crmd[1584]: info: Connection to the Policy Engine released
Sep 26 13:02:21 rh72-01 cib[1579]: info: Forwarding cib_modify operation for section status to all (origin=local/crmd/56)
Sep 26 13:02:21 rh72-01 cib[1579]: info: Diff: --- 0.6.2 2
Sep 26 13:02:21 rh72-01 cib[1579]: info: Diff: +++ 0.6.3 (null)
Sep 26 13:02:21 rh72-01 cib[1579]: info: +  /cib:  @num_updates=3Sep 26 13:02:21 rh72-01 cib[1579]: info: +  /cib/status/node_state[@id='1']:  @crm-debug-origin=do_dc_release, @expected=down
Sep 26 13:02:21 rh72-01 cib[1579]: info: Completed cib_modify operation for section status: OK (rc=0, origin=rh72-01/crmd/56, version=0.6.3)
Sep 26 13:02:21 rh72-01 crmd[1584]: info: Transitioner is now inactive
Sep 26 13:02:21 rh72-01 crmd[1584]: info: Disconnecting STONITH...
Sep 26 13:02:21 rh72-01 crmd[1584]: info: Fencing daemon disconnected
-------------

It has a problem to do fail-over without a resource being stopped.

When we make rh72-01 node standby, there is no problem.
There seems to be the problem with the control of the Pacemaker stop somehow or other.

I registered these contents with Bugzilla.
- http://bugs.clusterlabs.org/show_bug.cgi?id=5300

I attached crm_report to Bugzilla.

Bet Regards,
Hideo Yamauchi.