[ClusterLabs] Locate resource with functioning member of clone set?

Thu Nov 17 17:37:40 UTC 2016

I have a resource that is set up as a clone set across my cluster, partly for pseudo-load balancing (If someone wants to perform an action that will take a lot of resources, I can have them do it on a different node than the primary one), but also simply because the resource can take several seconds to start, and by having it already running as a clone set, I can failover in the time it takes to move an IP resource - essentially zero down time.

This is all well and good, but I ran into a problem the other day where the process on one of the nodes stopped working properly. Pacemaker caught the issue, and tried to fix it by restarting the resource, but was unable to because the old instance hadn't actually exited completely and was still tying up the TCP port, thereby preventing the new instance that pacemaker launched from being able to start.

So this leaves me with two questions: 

1) is there a way to set up a "kill script", such that before trying to launch a new copy of a process, pacemaker will run this script, which would be responsible for making sure that there are no other instances of the process running?
2) Even in the above situation, where pacemaker couldn't launch a good copy of the resource on the one node, the situation could have been easily "resolved" by pacemaker moving the virtual IP resource to another node where the cloned resource was running correctly, and notifying me of the problem. I know how to make colocation constraints in general, but how do I do a colocation constraint with a cloned resource where I just need the virtual IP running on *any* node where there clone is working properly? Or is it the same as any other colocation resource, and pacemaker is simply smart enough to both try to restart the failed resource and move the virtual IP resource at the same time?

As an addendum to question 2, I'd be interested in any methods there may be to be notified of changes in the cluster state, specifically things like when a resource fails on a node - my current nagios/icinga setup doesn't catch that when pacemaker properly moves the resource to a different node, because the resource remains up (which, of course, is the whole point), but it would still be good to know something happened so I could look into it and see if something needs fixed on the failed node to allow the resource to run there properly.

Thanks!
-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20161117/cb63f18b/attachment-0006.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Israel Brewster.vcf
Type: text/directory
Size: 417 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20161117/cb63f18b/attachment-0003.bin>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20161117/cb63f18b/attachment-0007.html>