[ClusterLabs] Default resource stickiness issue with colocation constraint

Tue Mar 31 10:38:53 EDT 2020

On Tue, 2020-03-31 at 07:37 +0300, Strahil Nikolov wrote:
> On March 31, 2020 6:01:35 AM GMT+03:00, Ken Gaillot <
> kgaillot at redhat.com> wrote:
> > On Sun, 2020-03-08 at 18:11 +0000, Strahil Nikolov wrote:
> > > Hello All,
> > > 
> > > can someone help me figure something out.
> > > 
> > > I have a test cluster with 2 resource groups:
> > > 
> > > [root at node3 cluster]# pcs status
> > > Cluster name: HACLUSTER16
> > > Stack: corosync
> > > Current DC: node3.localdomain (version 1.1.20-5.el7_7.2-
> > > 3c4c782f70) -
> > > partition with quorum
> > > Last updated: Sun Mar  8 20:00:48 2020
> > > Last change: Sun Mar  8 20:00:04 2020 by root via cibadmin on
> > > node3.localdomain
> > > 
> > > 3 nodes configured
> > > 14 resources configured
> > > 
> > > Node node2.localdomain: standby
> > > Node node3.localdomain: standby
> > > Online: [ node1.localdomain ]
> > > 
> > > Full list of resources:
> > > 
> > >  RHEVM  (stonith:fence_rhevm):  Started node1.localdomain
> > >  MPATH  (stonith:fence_mpath):  Started node1.localdomain
> > >  Resource Group: NFS
> > >      NFS_LVM    (ocf::heartbeat:LVM):   Started node1.localdomain
> > >      NFS_infodir        (ocf::heartbeat:Filesystem):    Started
> > > node1.localdomain
> > >      NFS_data   (ocf::heartbeat:Filesystem):    Started
> > > node1.localdomain
> > >      NFS_IP     (ocf::heartbeat:IPaddr2):       Started
> > > node1.localdomain
> > >      NFS_SRV    (ocf::heartbeat:nfsserver):     Started
> > > node1.localdomain
> > >      NFS_XPRT1  (ocf::heartbeat:exportfs):      Started
> > > node1.localdomain
> > >      NFS_NTFY   (ocf::heartbeat:nfsnotify):     Started
> > > node1.localdomain
> > >  Resource Group: APACHE
> > >      APACHE_LVM (ocf::heartbeat:LVM):   Started node1.localdomain
> > >      APACHE_cfg (ocf::heartbeat:Filesystem):    Started
> > > node1.localdomain
> > >      APACHE_data        (ocf::heartbeat:Filesystem):    Started
> > > node1.localdomain
> > >      APACHE_IP  (ocf::heartbeat:IPaddr2):       Started
> > > node1.localdomain
> > >      APACHE_SRV (ocf::heartbeat:apache):        Started
> > > node1.localdomain
> > > 
> > > The constraints I have put are:
> > > 
> > > [root at node3 cluster]# pcs constraint
> > > Location Constraints:
> > >   Resource: APACHE
> > >     Enabled on: node1.localdomain (score:3000)
> > >     Enabled on: node2.localdomain (score:2000)
> > >     Enabled on: node3.localdomain (score:1000)
> > >   Resource: NFS
> > >     Enabled on: node1.localdomain (score:1000)
> > >     Enabled on: node2.localdomain (score:2000)
> > >     Enabled on: node3.localdomain (score:3000)
> > > Ordering Constraints:
> > > Colocation Constraints:
> > >   APACHE with NFS (score:-1000)
> > > Ticket Constraints:
> > > 
> > > [root at node3 cluster]# pcs resource defaults
> > > resource-stickiness=1000
> > > 
> > > As you can see the default stickiness is 1000 per resource or
> > > 7000
> > > for the APACHE group.
> > > The colocation rule score is just -1000 and as per my
> > > understanding
> > > it should be ignored when the 2 nodes are removed from standby.
> > > 
> > > Can someone clarify why the APACHE group is moved , when the
> > > resource
> > > stickiness score is higher than the colocation score.
> > > 
> > > I have attached a file with the crm_simulate output (the output
> > > is
> > > correct, when the standby is removed - the group is moved).
> > > 
> > > Best Regards,
> > > Strahil Nikolov
> > 
> > Coincidentally I just fixed a bug last week that I believe is the
> > culprit here. I expect if you test the current master branch it
> > won't
> > happen. The fix will be in 2.0.4 (the first release candidate is
> > expected in a couple of weeks).
> > 
> > The problem was in the code that incorporates colocation
> > dependencies'
> > node preferences. If a group was colocated with some resource, the
> > resource would incorporate the scores from each member of the group
> > in
> > turn. However each member of the group would also incorporate its
> > own
> > dependencies' scores in its score -- which includes the internal
> > group
> > colocation of all members after it. So, the members of the
> > colocated
> > group were being counted multiple times, and therefore having a
> > bigger
> > impact than the configured colocation score. The fix was just to
> > incorporate scores from the first group member since it would
> > incorporate all the rest.
> 
> Hey Ken,
> 
> Thanks for the detailed  explanation and good job !
> So, in latest upstream version the bug is fixed.What about RHEL -
> should I open a bugzilla ?
> 
> Best Regards,
> Strahil Nikolov

The fix is expected to land in RHEL 7.9 and 8.3.
-- 
Ken Gaillot <kgaillot at redhat.com>