[ClusterLabs] Default resource stickiness issue with colocation constraint

Tue Mar 31 00:37:07 EDT 2020

On March 31, 2020 6:01:35 AM GMT+03:00, Ken Gaillot <kgaillot at redhat.com> wrote:
>On Sun, 2020-03-08 at 18:11 +0000, Strahil Nikolov wrote:
>> Hello All,
>> 
>> can someone help me figure something out.
>> 
>> I have a test cluster with 2 resource groups:
>> 
>> [root at node3 cluster]# pcs status
>> Cluster name: HACLUSTER16
>> Stack: corosync
>> Current DC: node3.localdomain (version 1.1.20-5.el7_7.2-3c4c782f70) -
>> partition with quorum
>> Last updated: Sun Mar  8 20:00:48 2020
>> Last change: Sun Mar  8 20:00:04 2020 by root via cibadmin on
>> node3.localdomain
>> 
>> 3 nodes configured
>> 14 resources configured
>> 
>> Node node2.localdomain: standby
>> Node node3.localdomain: standby
>> Online: [ node1.localdomain ]
>> 
>> Full list of resources:
>> 
>>  RHEVM  (stonith:fence_rhevm):  Started node1.localdomain
>>  MPATH  (stonith:fence_mpath):  Started node1.localdomain
>>  Resource Group: NFS
>>      NFS_LVM    (ocf::heartbeat:LVM):   Started node1.localdomain
>>      NFS_infodir        (ocf::heartbeat:Filesystem):    Started
>> node1.localdomain
>>      NFS_data   (ocf::heartbeat:Filesystem):    Started
>> node1.localdomain
>>      NFS_IP     (ocf::heartbeat:IPaddr2):       Started
>> node1.localdomain
>>      NFS_SRV    (ocf::heartbeat:nfsserver):     Started
>> node1.localdomain
>>      NFS_XPRT1  (ocf::heartbeat:exportfs):      Started
>> node1.localdomain
>>      NFS_NTFY   (ocf::heartbeat:nfsnotify):     Started
>> node1.localdomain
>>  Resource Group: APACHE
>>      APACHE_LVM (ocf::heartbeat:LVM):   Started node1.localdomain
>>      APACHE_cfg (ocf::heartbeat:Filesystem):    Started
>> node1.localdomain
>>      APACHE_data        (ocf::heartbeat:Filesystem):    Started
>> node1.localdomain
>>      APACHE_IP  (ocf::heartbeat:IPaddr2):       Started
>> node1.localdomain
>>      APACHE_SRV (ocf::heartbeat:apache):        Started
>> node1.localdomain
>> 
>> The constraints I have put are:
>> 
>> [root at node3 cluster]# pcs constraint
>> Location Constraints:
>>   Resource: APACHE
>>     Enabled on: node1.localdomain (score:3000)
>>     Enabled on: node2.localdomain (score:2000)
>>     Enabled on: node3.localdomain (score:1000)
>>   Resource: NFS
>>     Enabled on: node1.localdomain (score:1000)
>>     Enabled on: node2.localdomain (score:2000)
>>     Enabled on: node3.localdomain (score:3000)
>> Ordering Constraints:
>> Colocation Constraints:
>>   APACHE with NFS (score:-1000)
>> Ticket Constraints:
>> 
>> [root at node3 cluster]# pcs resource defaults
>> resource-stickiness=1000
>> 
>> As you can see the default stickiness is 1000 per resource or 7000
>> for the APACHE group.
>> The colocation rule score is just -1000 and as per my understanding
>> it should be ignored when the 2 nodes are removed from standby.
>> 
>> Can someone clarify why the APACHE group is moved , when the resource
>> stickiness score is higher than the colocation score.
>> 
>> I have attached a file with the crm_simulate output (the output is
>> correct, when the standby is removed - the group is moved).
>> 
>> Best Regards,
>> Strahil Nikolov
>
>Coincidentally I just fixed a bug last week that I believe is the
>culprit here. I expect if you test the current master branch it won't
>happen. The fix will be in 2.0.4 (the first release candidate is
>expected in a couple of weeks).
>
>The problem was in the code that incorporates colocation dependencies'
>node preferences. If a group was colocated with some resource, the
>resource would incorporate the scores from each member of the group in
>turn. However each member of the group would also incorporate its own
>dependencies' scores in its score -- which includes the internal group
>colocation of all members after it. So, the members of the colocated
>group were being counted multiple times, and therefore having a bigger
>impact than the configured colocation score. The fix was just to
>incorporate scores from the first group member since it would
>incorporate all the rest.

Hey Ken,

Thanks for the detailed  explanation and good job !
So, in latest upstream version the bug is fixed.What about RHEL - should I open a bugzilla ?

Best Regards,
Strahil Nikolov