[ClusterLabs] Information regrading pacemaker integration with user application

Tue Mar 17 09:37:38 EDT 2015

On 03/16/2015 12:56 AM, Ramya Ramadurai wrote:
> We are evaluating pacemaker HA provided by linux to integrate with the user application being developed for a telecom system
> The requirement of this application in a high level is
> 1)The application is to be run on a ATCA chassis 46XX which is a 6 slot chassis consisting of 2 SCM (shelf management cards) and 4 CPM -9 blades
> The application will be sitting on CPM-9 blades .
> 2)In order to support the HA for this application ,we are planning to make one of the CPM-9 cards as standby card which will take over once the active goes faulty.Active unit can go fault either when application goes down or due to ATCA hardware ,any ATCA related faults etc.
> 3)The application after switchover to standby unit should function as if there should be no downtime(probably in the order of millisec) and it should maintain all the
> Real time data.
> 
> We have gone through the Clusters from scratch document and basic information on how pacemaker works .But we did not find any example where pacemaker exposes some API
> For any user application to be integrated
> From the doxygen related documentation also ,we are unable to map on what is the way the functions are called,what is the purpose of functions etc ..
> Can you please help us in getting some details and in depth on pacemaker internals.Also it would be of great help if you can suggest on how to integrate the above application to pacemaker,the components to be built around pacemaker to interact with ATCA

Hi Ramya,

I don't think you need the pacemaker API; configuration is likely all
you need. At most, you may need to write custom resource agent(s) and/or
fence agent(s) if ones are not already available for your needs.

Your active and standby CPM-9s would be cluster nodes in the pacemaker
configuration. You might want to consider adding a third card to the
cluster to support quorum (the quorum node doesn't have to be allowed to
run the application, and it doesn't have to be exclusively dedicated to
the cluster -- it's just a tiebreaker vote if communication fails
between the other nodes).

Resource agents and fence agents are simple programs (usually shell
scripts) that take commands from pacemaker via the command line and
return status via the exit code.

So whatever your user application is, it would have a resource agent.
The agent accepts commands such as start and stop, and returns
particular exit codes for success, failure, etc.

The fence agent is, as a last resort, to kill a CPM-9 if it is no longer
responding and the cluster can't guarantee otherwise that it's not
accessing shared resources. The fence agent would accept certain
commands from pacemaker, and then talk to the ATCA controller to power
down or otherwise reset the targeted CPM-9.

Maintaining identical user data between the active and standby nodes is
straightforward and a common cluster design. A common setup is using
DRBD to mirror the data (and a clustered file system on top of that if
both nodes need to access the data at the same time). "Clusters From
Scratch" has a good example of this. Other setups are possible, e.g.
using network storage such as a SAN. All of this is done via pacemaker
configuration.

Your downtime requirements will be much trickier. Pacemaker detects
outages by periodically polling the resource agent for status. So you
have the polling interval (at least 1 second), then a timeout for
receiving the status (dependent on your application and resource agent),
then the startup time for the application on the other node (dependent
on your application).

Depending on your application, you may be able to run it as a "clone",
so that it is always running on both nodes, and then use a floating IP
address to talk to it. Moving an IP address is quick, so that would
eliminate the need to wait for application startup. However you still
have the polling interval and status timeout.

-- Ken Gaillot <kgaillot at redhat.com>