[Pacemaker] compression with heartbeat doesn't seem to work

Fri Aug 19 09:02:24 EDT 2011

Hi,
  We are running a two-node cluster using pacemaker 1.1.5-18.1 with heartbeat 3.0.4-41.1.  We are experiencing what seems like network issues and cannot make heartbeat recover.  We are experiencing "message too long" and the systems can no longer sync.

Our ha.cf is as follows:
autojoin none
use_logd false
logfacility daemon
debug 0

# use the v2 cluster resource manager
crm yes

# the cluster communication happens via unicast on bond0 and hb1
# hb1 is direct connect
ucast hb1 169.254.1.3
ucast hb1 169.254.1.4
ucast bond0 172.28.102.21
ucast bond0 172.28.102.51
compression zlib
compression_threshold 30

# msgfmt
msgfmt netstring

# a node will be flagged as dead if there is not response for 20 seconds
deadtime 30
initdead 30
keepalive 250ms
uuidfrom nodename

# these are the node names participating in the cluster
# the names should match "uname -n" output on the system
node usrv-qpr2
node usrv-qpr5

We can ping all interfaces from both nodes.  One of the bonded NICs had some trouble, but we believe we have enough redundancy built in that it should be fine.
The issue we see that if we reboot the non DC node it can no longer sync with the DC.  The log from the non-dc node shows remote node cannot be reached.  Crm_mon of the non-dc node shows:

Last updated: Fri Aug 19 07:39:05 2011
Stack: Heartbeat
Current DC: NONE
2 Nodes configured, 2 expected votes
26 Resources configured.
============

Node usrv-qpr2 (87df4a75-fa67-c05e-1a07-641fa79784e0): UNCLEAN (offline)
Node usrv-qpr5 (7fb57f74-fae5-d493-e2c7-e4eda2430217): UNCLEAN (offline)

>From the DC it looks like all is well.

I tried a cibadmin -Q from non DC and it can no longer contact the remote node.

I tried a cibadmin -S from the non DC to force a sync which times out with Call cib_sync failed (-41): Remote node did not respond.

On the DC side I see this:
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23249]: ERROR: write_child: write failure on ucast bond0.: Message too long
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23251]: ERROR: glib: ucast_write: Unable to send HBcomm packet bond0 172.28.102.51:694 len=83696 [-1]: Message too long
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23251]: ERROR: write_child: write failure on ucast bond0.: Message too long
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23253]: ERROR: glib: ucast_write: Unable to send HBcomm packet hb1 169.254.1.3:694 len=83696 [-1]: Message too long
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23253]: ERROR: write_child: write failure on ucast hb1.: Message too long
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23255]: ERROR: glib: ucast_write: Unable to send HBcomm packet hb1 169.254.1.4:694 len=83696 [-1]: Message too long
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23255]: ERROR: write_child: write failure on ucast hb1.: Message too long
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is filling up (500 messages in queue)
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is filling up (500 messages in queue)
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Cannot rexmit pkt 244442 for usrv-qpr5: seqno too low
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: fromnode =usrv-qpr5, fromnode's ackseq = 244435
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hist information:
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hiseq =244943, lowseq=244443,ackseq=244435,lastmsg=442
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Cannot rexmit pkt 244442 for usrv-qpr5: seqno too low
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: fromnode =usrv-qpr5, fromnode's ackseq = 244435
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hist information:
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hiseq =244943, lowseq=244443,ackseq=244435,lastmsg=442
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is filling up (500 messages in queue)
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is filling up (500 messages in queue)
Aug 19 07:38:22 usrv-qpr2 heartbeat: [23222]: info: all clients are now resumed

My questions:

1)      Seems like the compression is not working.  Is there something we need to do to enable it?  We have tried both bz2 and  zlib.  We've played with the compression threshold as well.

2)      How do we get the non DC system back on-line?  Rebooting does not work since the DC can't seem to send the diffs to sync it.

3)      If the diff it is trying to send is truly too long, how do I recover from that?

4)      Would more information be useful in diagnosing the problem?

Thanks in advance.
Diane Schaefer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20110819/a2ba0086/attachment-0001.html>