[Pacemaker] New System Health feature

Mark Hamzy hamzy at us.ibm.com
Thu Apr 23 11:49:43 EDT 2009


I am working on a feature to add system health metrics to HA.  With this
information, HA could failover nodes away from hardware that might have
problems.  The initial proposal briefly started on the linux-HA mailing
list, but it has been moved to the pacemaker mailing list.

The following is a short description of what we want this new feature to

Feature Name:     Health monitoring support
Purpose:    Allow pacemaker to schedule resources in a way that's sensitive
to a variety of server-related health metrics

Add support in pacemaker for a class of attributes which would be specially
treated.  Under this proposal, all attributes defined for a node whose name
matches the regular expression /^#health-.*$/ would be automatically added
into the score for each resource being considered for scheduling on that

The purpose of this is to allow multiple independent health monitors to
each set their own health status and have that taken into account when
scheduling resources.  For example, IBM might define one called
#health-ibmserver.  Someone using smarttools (disk health monitors) might
define one called #health-smarttools.  Someone else using IPMI might define
one called #health-ipmi.   This means that this feature is not specific to
any vendor, and various health monitor providers can develop health metrics
for their hardware and not have to coordinate with each other in their
development process.

Typical usage of these variables is expected to be something like this:

      Health      Attribute-value   Meaning
      green 1000        server is happy, capable of running any resource
      yellow      0           server is marginal - it is desirable to
schedule resources somewhere else if you can
      red	-INFINITY      server is unreliable (but still up) and should not
be used

Note that all of the values given would be configuration-specific.  These
attributes would be set via attrd_updater.

Should the translation of health scores (colors) into specific valuse be
done outside the core system?

There should be an API for health monitoring agents.

This would be similar to cluster-wide default set by symmetric-cluster true
(0) or false (-INFINITY).

Special Note:
IBM is already in the process of developing such a health monitoring tool
for IBM X (intel-class) servers.

So, what do you all think of this proposed functionality?  Does it sound
reasonable?  Comments are appreciated.

