[ClusterLabs] [Problem] When a group resource does not stop in a trouble node, the movement of the group resource is started in other nodes.

Wed Oct 5 16:43:53 CEST 2016

Hi All, 

After Pacemaker1.1.14, there may be a problem in order of the stop of the group resource. 
The problem occurs by cluster constitution without STONITH. 

I can confirm it in the next procedure. 

Step 1) Copy Dummy resource and make Dummy1 resource and Dummy2 resource.

Step 2) Constitute a cluster. 

[root at rh72-01 ~]# crm_mon -1 -Af
Stack: corosync
Current DC: rh72-02 (version 1.1.15-e174ec8) - partition with quorum
Last updated: Wed Oct  5 16:24:21 2016          
Last change: Wed Oct  5
16:24:15 2016 by root via cibadmin on rh72-01 
2 nodes and 2 resources configured 
Online: [ rh72-01 rh72-02 ] 
Resource Group: grpDummy prmDummy1  (ocf::pacemaker:Dummy1):
        Started rh72-01 prmDummy2  (ocf::pacemaker:Dummy2):        Started rh72-01 
Node Attributes:
* Node rh72-01:
* Node rh72-02: Migration Summary:
* Node rh72-01:
* Node rh72-02: 

Step 3) Set pseudotrouble in stop of Dummy2.
(snip)
dummy_stop() {
return $OCF_ERR_GENERIC dummy_monitor 
if [ $? -eq $OCF_SUCCESS ]; then
 rm ${OCF_RESKEY_state} 
fi 
rm -f "${VERIFY_SERIALIZED_FILE}" 
return $OCF_SUCCESS
}
(snip) 

Step 4) Make rh72-01 node standby. Trouble occurs in Dummy2 resource, and the resource does not move. 

[root at rh72-01 ~]# crm_standby -N rh72-01 -v on
[root at rh72-01 ~]# crm_mon -1 -Af
Stack: corosync
Current DC: rh72-02 (version 1.1.15-e174ec8) - partition with quorum
Last updated: Wed Oct  5 16:27:49 2016          
Last change: Wed Oct  5
16:27:47 2016 by root via crm_attribute on rh72-01 
2 nodes and 2 resources configured 
Node rh72-01: standby
Online: [ rh72-02 ] 
Resource Group: grpDummy
 prmDummy1  (ocf::pacemaker:Dummy1):        Started rh72-01
 prmDummy2  (ocf::pacemaker:Dummy2):        FAILED rh72-01 (blocked) Node Attributes:
* Node rh72-01:
* Node rh72-02: Migration Summary:
* Node rh72-01: prmDummy2: migration-threshold=1 fail-count=1000000 last-failure='Wed Oct  5
16:29:29 2016'
* Node rh72-02: Failed Actions:
* prmDummy2_stop_0 on rh72-01 'unknown error' (1): call=15, status=complete,
exitreason='none', last-rc-change='Wed Oct  5 16:27:47 2016', queued=1ms, exec=34ms 

Step 5) Clean Dummy2 resource. 

[root at rh72-01 ~]# crm_resource -C -r prmDummy2 -H rh72-01 -f
Cleaning up prmDummy2 on rh72-01, removing fail-count-prmDummy2
Waiting for 1 replies from the CRMd. OK

[root at rh72-01 ~]# crm_mon -1 -Af
Stack: corosync
Current DC: rh72-02 (version 1.1.15-e174ec8) - partition with quorum
Last updated: Wed Oct  5 16:30:55 2016          
Last change: Wed Oct  5
16:30:53 2016 by hacluster via crmd on rh72-01 
2 nodes and 2 resources configured 
Node rh72-01: standby
Online: [ rh72-02 ] 
Resource Group: grpDummy
 prmDummy1  (ocf::pacemaker:Dummy1):        Started rh72-02
 prmDummy2  (ocf::pacemaker:Dummy2):        FAILED rh72-01 (blocked) 
Node Attributes:
* Node rh72-01:
* Node rh72-02: Migration Summary:
* Node rh72-01: prmDummy2: migration-threshold=1 fail-count=1000000 last-failure='Wed Oct  5
16:32:35 2016'
* Node rh72-02: Failed Actions:
* prmDummy2_stop_0 on rh72-01 'unknown error' (1): call=23, status=complete,
exitreason='none', last-rc-change='Wed Oct  5 16:30:54 2016', queued=0ms, exec=35ms 

Trouble occurs again, and the Dummy2 resource does not move, but the Dummy1 resource moves to rh72-02 node.

If all the resources of the group do not stop, the resource should not move. 

The problem does not occur in Pacemaker1.1.13. An event of probe_complete is abolished by Pacemaker1.1.14.

It is thought that a problem is included near the next correction.
 * https://github.com/ClusterLabs/pacemaker/commit/c1438ae489d791cc689625332b8ced21bfd4d143#diff-8e7ae81c93497126538c2a82fe183692
 * https://github.com/ClusterLabs/pacemaker/commit/8f76b782133857b40a583e947d743d45c7d05dc8#diff-8e7ae81c93497126538c2a82fe183692 

I registered this problem with Bugzilla.
 * http://bugs.clusterlabs.org/show_bug.cgi?id=5301

Best Regards,
Hideo Yamauch.