[ClusterLabs] Information regrading pacemaker integration with user application

Tue Mar 17 11:20:19 EDT 2015

Hi Ken

Thanks a lot for clarifying the doubts .
Can you share any sample C based resource agent code which we can use as reference for an application .
I have seen the LSB /OCF based scripts but in our application there can be various scenarios where we need to monitor various 
Faults(other than process kill it can be related hardware related faults or process itself is not able to process due to CPU overload/memory threshold etc )
In this case we can write some specific monitoring function in resource agent to communicate the fault error codes to cluster management to take 
Appropriate actions

Thanks
ramya 

-----Original Message-----
From: Ken Gaillot [mailto:kgaillot at redhat.com] 
Sent: 17 March 2015 19:08
To: users at clusterlabs.org
Subject: Re: [ClusterLabs] Information regrading pacemaker integration with user application

On 03/16/2015 12:56 AM, Ramya Ramadurai wrote:
> We are evaluating pacemaker HA provided by linux to integrate with the 
> user application being developed for a telecom system The requirement 
> of this application in a high level is 1)The application is to be run 
> on a ATCA chassis 46XX which is a 6 slot chassis consisting of 2 SCM (shelf management cards) and 4 CPM -9 blades The application will be sitting on CPM-9 blades .
> 2)In order to support the HA for this application ,we are planning to make one of the CPM-9 cards as standby card which will take over once the active goes faulty.Active unit can go fault either when application goes down or due to ATCA hardware ,any ATCA related faults etc.
> 3)The application after switchover to standby unit should function as 
> if there should be no downtime(probably in the order of millisec) and it should maintain all the Real time data.
> 
> We have gone through the Clusters from scratch document and basic 
> information on how pacemaker works .But we did not find any example 
> where pacemaker exposes some API For any user application to be integrated From the doxygen related documentation also ,we are unable to map on what is the way the functions are called,what is the purpose of functions etc ..
> Can you please help us in getting some details and in depth on 
> pacemaker internals.Also it would be of great help if you can suggest 
> on how to integrate the above application to pacemaker,the components 
> to be built around pacemaker to interact with ATCA

Hi Ramya,

I don't think you need the pacemaker API; configuration is likely all you need. At most, you may need to write custom resource agent(s) and/or fence agent(s) if ones are not already available for your needs.

Your active and standby CPM-9s would be cluster nodes in the pacemaker configuration. You might want to consider adding a third card to the cluster to support quorum (the quorum node doesn't have to be allowed to run the application, and it doesn't have to be exclusively dedicated to the cluster -- it's just a tiebreaker vote if communication fails between the other nodes).

Resource agents and fence agents are simple programs (usually shell
scripts) that take commands from pacemaker via the command line and return status via the exit code.

So whatever your user application is, it would have a resource agent.
The agent accepts commands such as start and stop, and returns particular exit codes for success, failure, etc.

The fence agent is, as a last resort, to kill a CPM-9 if it is no longer responding and the cluster can't guarantee otherwise that it's not accessing shared resources. The fence agent would accept certain commands from pacemaker, and then talk to the ATCA controller to power down or otherwise reset the targeted CPM-9.

Maintaining identical user data between the active and standby nodes is straightforward and a common cluster design. A common setup is using DRBD to mirror the data (and a clustered file system on top of that if both nodes need to access the data at the same time). "Clusters From Scratch" has a good example of this. Other setups are possible, e.g.
using network storage such as a SAN. All of this is done via pacemaker configuration.

Your downtime requirements will be much trickier. Pacemaker detects outages by periodically polling the resource agent for status. So you have the polling interval (at least 1 second), then a timeout for receiving the status (dependent on your application and resource agent), then the startup time for the application on the other node (dependent on your application).

Depending on your application, you may be able to run it as a "clone", so that it is always running on both nodes, and then use a floating IP address to talk to it. Moving an IP address is quick, so that would eliminate the need to wait for application startup. However you still have the polling interval and status timeout.

-- Ken Gaillot <kgaillot at redhat.com>

_______________________________________________
Users mailing list: Users at clusterlabs.org http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org