[Pacemaker] Upgrading to Pacemaker 1.1.7. Issue: sticky resources failing back after reboot

Mon Sep 10 10:15:01 EDT 2012

----- Original Message -----
> From: "Parshvi" <parshvi.17 at gmail.com>
> To: pacemaker at clusterlabs.org
> Sent: Monday, September 10, 2012 4:06:51 AM
> Subject: Re: [Pacemaker]	Upgrading to Pacemaker 1.1.7. Issue: sticky resources	failing back after reboot
> 
> David Vossel <dvossel at ...> writes:
> > > Hi,
> > > We have upgraded pacemaker version 1.0.12 to 1.1.7
> > > The upgrade was done since resources failed to recover after a
> > > timeout
> > > (monitor|stop[unmanaged]) and logs observed are:
> > > 
> > > WARN: print_graph: Synapse 6 is pending (priority: 0)
> > > Sep 03 16:55:18 CSS-FU-2 crmd: [25200]: WARN: print_elem: [Action
> > > 103]: Pending
> > > (id: SnmpAgent_monitor_5000, loc: CSS-FU-2, priority: 0)
> > > Sep 03 16:55:18 CSS-FU-2 crmd: [25200]: WARN: print_elem: *
> > > [Input
> > > 102]: Pending
> > > (id: SnmpAgent_start_0, loc: CSS-FU-2, priority: 0)
> > > Sep 03 16:55:18 CSS-FU-2 crmd: [25200]: WARN: print_graph:
> > > Synapse 7
> > > is pending
> > > (priority: 0)
> > > 
> > > Reading through the forum mails, it was inferred that this issue
> > > is
> > > fixed in
> > > 1.1.7
> > > 
> > > Platform OS: OEL 5.8
> > > Pacemaker Version: 1.1.7
> > > Corosync version: 1.4.3
> > > 
> > > Pacemaker and all its dependent packages were built from source
> > > (tarball:github).
> > > glib version used for build: 2.32.2
> > > 
> > > The following issue is observed in Pacemaker 1.1.7:
> > > 1) There is a two-node cluster.
> > > 2) When primary node is rebooted/or pacemaker is restarted, the
> > > resources fail-
> > > over to secondary.
> > > 3) There are 4 group of services.
> > >    2 group are not sticky.
> > >    1 group is master/slave multi-state resource
> > >    1 group is STICKY
> > > 4) When primary node comes online, even the sticky resources fail
> > > back to
> > > primary node (Issue)
> > > 5) Now, if the secondary node is rebooted, the resources fail
> > > over to
> > > primary
> > > node.
> > > 6) Once the secondary node is up, only non-sticky resources
> > > fail-back. Sticky
> > > resources remain on primary node.
> > > 
> > > 7) Even if Location preference of sticky resources is set for
> > > Node-2(the
> > > secondary node), still sticky resources fail-back on Node-1.
> > > 
> > > We're using pacemaker 1.0.12 on Production. We're facing issues
> > > of
> > > IPaddr and
> > > other resources monitor operation timing out and pacemaker not
> > > recovering from
> > > it (shared above).
> > > 
> > > Any help is welcome.
> > > 
> > > PS: Please mention, if any logs or configuration needs to be
> > > shared.
> > 
> > My guess is that this is an issue with node scores for the
> > resources in
> question.  Stickiness and location
> > constraints work in a similar way.  You could really think of
> > resource
> stickiness as a temporary location
> > constraint on a resource that changes depending on what node it is
> > on.
> > 
> > If you have a resource with stickiness enabled and you want the
> > resource to
> stay put, the stickiness score
> > has to out weigh all the location constraints for that resource on
> > other
> nodes.  If you are using colocation
> > constraints, this becomes increasingly complicated as a resources
> > per node
> location score could change
> > based on the location of another resource.
> > 
> > For specific advice on your scenario, there is little we can offer
> > without
> seeing your exact configuration.
> > 
> Hi David,
> Thanks for a quick response.
> 
> I have shared the configuration on the following path:
> https://dl.dropbox.com/u/20096935/cib.txt
> 
> The issue has been observed for the following group:
> 1) Rsc_Ms1
> 2) Rsc_S
> 3) Rsc_T
> 4) Rsc_TGroupClusterIP
> 
> Colocation: Resources 1) 2) and 3) have been colocated with resource
> 4)
> Location preference: Resource 4) prefers a one of the nodes in the
> cluster
> Ordering: Resources 1) 2) and 3) would be started (no sequential
> ordering
> between these resources) when rsc 4) is started.
> 

I'm not an expert when it comes to scoring but if you want a resource to prefer to stay on current node instead of fail-back to preferred node then I would definitely set the resource-stickiness value much higher to ensure that behavior - test with 20 or 50 or 200 as the stickiness value and at least the fail-back problem should go away.

I believe the issue is that you have multiple collocated resources that have a node location preference with score 1.  This causes the score to be the sum of the collocated resources location scores and that sum is higher than your stickiness score of 1 causing the movement.

HTH

Jake