[ClusterLabs] Locate resource with functioning member of clone set?

Thu Nov 17 20:04:49 EST 2016

On 11/17/2016 11:37 AM, Israel Brewster wrote:
> I have a resource that is set up as a clone set across my cluster,
> partly for pseudo-load balancing (If someone wants to perform an action
> that will take a lot of resources, I can have them do it on a different
> node than the primary one), but also simply because the resource can
> take several seconds to start, and by having it already running as a
> clone set, I can failover in the time it takes to move an IP resource -
> essentially zero down time.
> 
> This is all well and good, but I ran into a problem the other day where
> the process on one of the nodes stopped working properly. Pacemaker
> caught the issue, and tried to fix it by restarting the resource, but
> was unable to because the old instance hadn't actually exited completely
> and was still tying up the TCP port, thereby preventing the new instance
> that pacemaker launched from being able to start.
> 
> So this leaves me with two questions: 
> 
> 1) is there a way to set up a "kill script", such that before trying to
> launch a new copy of a process, pacemaker will run this script, which
> would be responsible for making sure that there are no other instances
> of the process running?

Sure, it's called a resource agent :)

When recovering a failed resource, Pacemaker will call the resource
agent's stop action first, then start. The stop should make sure the
service has exited completely. If it doesn't, the agent should be fixed
to do so.

> 2) Even in the above situation, where pacemaker couldn't launch a good
> copy of the resource on the one node, the situation could have been
> easily "resolved" by pacemaker moving the virtual IP resource to another
> node where the cloned resource was running correctly, and notifying me
> of the problem. I know how to make colocation constraints in general,
> but how do I do a colocation constraint with a cloned resource where I
> just need the virtual IP running on *any* node where there clone is
> working properly? Or is it the same as any other colocation resource,
> and pacemaker is simply smart enough to both try to restart the failed
> resource and move the virtual IP resource at the same time?

Correct, a simple colocation constraint of "resource R with clone C"
will make sure R runs with a working instance of C.

There is a catch: if *any* instance of C restarts, R will also restart
(even if it stays in the same place), because it depends on the clone as
a whole. Also, in the case you described, pacemaker would first try to
restart both C and R on the same node, rather than move R to another
node (although you could set on-fail=stop on C to force R to move).

If that's not sufficient, you could try some magic with node attributes
and rules. The new ocf:pacemaker:attribute resource in 1.1.16 could help
there.

> As an addendum to question 2, I'd be interested in any methods there may
> be to be notified of changes in the cluster state, specifically things
> like when a resource fails on a node - my current nagios/icinga setup
> doesn't catch that when pacemaker properly moves the resource to a
> different node, because the resource remains up (which, of course, is
> the whole point), but it would still be good to know something happened
> so I could look into it and see if something needs fixed on the failed
> node to allow the resource to run there properly.

Since 1.1.15, Pacemaker has alerts:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#idm47975782138832

Before 1.1.15, you can use the ocf:pacemaker:ClusterMon resource to do
something similar.

> 
> Thanks!
> -----------------------------------------------
> Israel Brewster
> Systems Analyst II
> Ravn Alaska
> 5245 Airport Industrial Rd
> Fairbanks, AK 99709
> (907) 450-7293
> -----------------------------------------------