[Pacemaker] Pacemaker delays (long posting)

Tue Mar 5 10:01:30 EST 2013

I have recently assumed the responsibility for maintaining code on one of my company's products that uses Pacemaker/Heartbeat.  I'm still coming up to speed on this code, and would like to solicit comments about some particular behavior.  For reference, the Pacemaker version is 1.0.9.1, and Heartbeat is version 3.0.3.

This product uses two host systems, each of which supports several disk enclosures, operating in an "active/passive" mode.  The two hosts are connected by redundant, dedicated 10Gb Ethernet links, which are used for messaging between them.  The disks in each enclosure are controlled by an instance of an application called SS.  If an "active" host's SS application fails for some reason, then the corresponding application on the "passive" host will assume control of the disks.  Each application is assigned a Pacemaker resource, and the resource agent communicates with the appropriate SS instance.  For reference, here's a sample crm_mon output:

============
Last updated: Tue Mar  5 06:10:22 2013
Stack: Heartbeat
Current DC: mgraid-12241530rn01433-0 (f4e5e15c-d06b-4e37-89b9-4621af05128f) - partition with quorum
Version: 1.0.9-89bd754939df5150de7cd76835f98fe90851b677
2 Nodes configured, unknown expected votes
9 Resources configured.
============

Online: [ mgraid-12241530rn01433-0 mgraid-12241530rn01433-1 ]

Clone Set: Fencing
     Started: [ mgraid-12241530rn01433-0 mgraid-12241530rn01433-1 ]
Clone Set: cloneIcms
     Started: [ mgraid-12241530rn01433-0 mgraid-12241530rn01433-1 ]
Clone Set: cloneOmserver
     Started: [ mgraid-12241530rn01433-0 mgraid-12241530rn01433-1 ]
Master/Slave Set: ms-SS11451532RN01389
     Masters: [ mgraid-12241530rn01433-1 ]
     Slaves: [ mgraid-12241530rn01433-0 ]
Master/Slave Set: ms-SS11481532RN01465
     Masters: [ mgraid-12241530rn01433-0 ]
     Slaves: [ mgraid-12241530rn01433-1 ]
Master/Slave Set: ms-SS12171532RN01613
     Masters: [ mgraid-12241530rn01433-0 ]
     Slaves: [ mgraid-12241530rn01433-1 ]
Master/Slave Set: ms-SS12241530RN01433
     Masters: [ mgraid-12241530rn01433-0 ]
     Slaves: [ mgraid-12241530rn01433-1 ]
Master/Slave Set: ms-SS12391532RN01768
     Masters: [ mgraid-12241530rn01433-0 ]
     Slaves: [ mgraid-12241530rn01433-1 ]
Master/Slave Set: ms-SS12391532RN01772
     Masters: [ mgraid-12241530rn01433-0 ]
     Slaves: [ mgraid-12241530rn01433-1 ]

I've been investigating the system's behavior when one or more master SS instances crashes, simulated by a kill command.  I've noticed two behaviors of interest.

First, in a simple case, where one master SS is killed, it takes about 10-12 seconds for the slave to complete the failover.  From the log files, the DC issues the following notifications to the slave SS:

*         Pre_notify_demote

*         Post_notify_demote

*         Pre_notify_stop

*         Post_notify_stop

*         Pre_notify_promote

*         Promote

*         Post_notify_promote

*         Monitor_3000

*         Pre_notify_start

*         Post_notify_start

These notifications and their confirmations appear to take about 1-2 seconds each, begging the following questions:

*         Is this sequence of notifications expected?

*         Is the 10-12 second timeframe expected?

Second, in a more complex case, where the master SS for each instance is assigned to the same can, and each SS is in turn killed with an approximate 10-second delay between kill commands, there appear to be very long delays in processing the notifications.  These delays appear to be associated with these factors

*         After an SS instance is killed, there's a 10-second monitor notification which causes a new SS instance to be launched to replace the missing SS instance.

*         It takes about 30 seconds for an SS instance to complete the startup process.  The resource agent waits for that startup to complete before returning to crmd.

*         Until the resource agent returns, crmd does not process notifications for any other SS/resource.
The net effect of these delays varies from one SS instance to another.  In some cases, the "normal" failover occurs, taking 10-12 seconds.  In other cases, there is no failover to the other host's SS instance, and there is no master/active SS instance for 1-2 minutes (until an SS instance is re-launched following the kill), depending upon the number of disk enclosures and thus the number of SS instances.

My first question in this case is simply whether the serialization of notifications among the various SS resources is expected?  In other words, transition notifications for one resource are delayed until earlier notifications are completed.  Is this the expected behavior?  Secondly, once the SS instance has been restarted, there's apparently no attempt to complete the failover; the new SS instance assumes the active/master role

Finally, a couple of general questions:

*         Is there any reason to believe that a later version of Pacemaker would behave differently?

*         Is there a mechanism by which the crmd (and lrmd) debug levels can be increased at run time (allowing more debug messages in the log output)?

Thanks very much for your help,
Michael Powell

[cid:image001.gif at 01CE1966.78F62940]

    Michael Powell
    Staff Engineer

    15220 NW Greenbrier Pkwy
        Suite 290
    Beaverton, OR   97006
    T 503-372-7327    M 503-789-3019   H 503-625-5332

    www.harmonicinc.com<http://www.harmonicinc.com>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130305/80c1804a/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.gif
Type: image/gif
Size: 1625 bytes
Desc: image001.gif
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130305/80c1804a/attachment-0002.gif>