[ClusterLabs] Default resource stickiness issue with colocation constraint

Mon Mar 30 23:01:35 EDT 2020

On Sun, 2020-03-08 at 18:11 +0000, Strahil Nikolov wrote:
> Hello All,
> 
> can someone help me figure something out.
> 
> I have a test cluster with 2 resource groups:
> 
> [root at node3 cluster]# pcs status
> Cluster name: HACLUSTER16
> Stack: corosync
> Current DC: node3.localdomain (version 1.1.20-5.el7_7.2-3c4c782f70) -
> partition with quorum
> Last updated: Sun Mar  8 20:00:48 2020
> Last change: Sun Mar  8 20:00:04 2020 by root via cibadmin on
> node3.localdomain
> 
> 3 nodes configured
> 14 resources configured
> 
> Node node2.localdomain: standby
> Node node3.localdomain: standby
> Online: [ node1.localdomain ]
> 
> Full list of resources:
> 
>  RHEVM  (stonith:fence_rhevm):  Started node1.localdomain
>  MPATH  (stonith:fence_mpath):  Started node1.localdomain
>  Resource Group: NFS
>      NFS_LVM    (ocf::heartbeat:LVM):   Started node1.localdomain
>      NFS_infodir        (ocf::heartbeat:Filesystem):    Started
> node1.localdomain
>      NFS_data   (ocf::heartbeat:Filesystem):    Started
> node1.localdomain
>      NFS_IP     (ocf::heartbeat:IPaddr2):       Started
> node1.localdomain
>      NFS_SRV    (ocf::heartbeat:nfsserver):     Started
> node1.localdomain
>      NFS_XPRT1  (ocf::heartbeat:exportfs):      Started
> node1.localdomain
>      NFS_NTFY   (ocf::heartbeat:nfsnotify):     Started
> node1.localdomain
>  Resource Group: APACHE
>      APACHE_LVM (ocf::heartbeat:LVM):   Started node1.localdomain
>      APACHE_cfg (ocf::heartbeat:Filesystem):    Started
> node1.localdomain
>      APACHE_data        (ocf::heartbeat:Filesystem):    Started
> node1.localdomain
>      APACHE_IP  (ocf::heartbeat:IPaddr2):       Started
> node1.localdomain
>      APACHE_SRV (ocf::heartbeat:apache):        Started
> node1.localdomain
> 
> The constraints I have put are:
> 
> [root at node3 cluster]# pcs constraint
> Location Constraints:
>   Resource: APACHE
>     Enabled on: node1.localdomain (score:3000)
>     Enabled on: node2.localdomain (score:2000)
>     Enabled on: node3.localdomain (score:1000)
>   Resource: NFS
>     Enabled on: node1.localdomain (score:1000)
>     Enabled on: node2.localdomain (score:2000)
>     Enabled on: node3.localdomain (score:3000)
> Ordering Constraints:
> Colocation Constraints:
>   APACHE with NFS (score:-1000)
> Ticket Constraints:
> 
> [root at node3 cluster]# pcs resource defaults
> resource-stickiness=1000
> 
> As you can see the default stickiness is 1000 per resource or 7000
> for the APACHE group.
> The colocation rule score is just -1000 and as per my understanding
> it should be ignored when the 2 nodes are removed from standby.
> 
> Can someone clarify why the APACHE group is moved , when the resource
> stickiness score is higher than the colocation score.
> 
> I have attached a file with the crm_simulate output (the output is
> correct, when the standby is removed - the group is moved).
> 
> Best Regards,
> Strahil Nikolov

Coincidentally I just fixed a bug last week that I believe is the
culprit here. I expect if you test the current master branch it won't
happen. The fix will be in 2.0.4 (the first release candidate is
expected in a couple of weeks).

The problem was in the code that incorporates colocation dependencies'
node preferences. If a group was colocated with some resource, the
resource would incorporate the scores from each member of the group in
turn. However each member of the group would also incorporate its own
dependencies' scores in its score -- which includes the internal group
colocation of all members after it. So, the members of the colocated
group were being counted multiple times, and therefore having a bigger
impact than the configured colocation score. The fix was just to
incorporate scores from the first group member since it would
incorporate all the rest.
-- 
Ken Gaillot <kgaillot at redhat.com>