[Pacemaker] compression with heartbeat doesn't seem to work

Tue Aug 23 06:16:25 EDT 2011

On Fri, Aug 19, 2011 at 08:02:24AM -0500, Schaefer, Diane E wrote:
> Hi,
>   We are running a two-node cluster using pacemaker 1.1.5-18.1 with heartbeat 3.0.4-41.1.  We are experiencing what seems like network issues and cannot make heartbeat recover.  We are experiencing "message too long" and the systems can no longer sync.
> 
> Our ha.cf is as follows:
> autojoin none
> use_logd false
> logfacility daemon
> debug 0
> 
> # use the v2 cluster resource manager
> crm yes
> 
> # the cluster communication happens via unicast on bond0 and hb1
> # hb1 is direct connect
> ucast hb1 169.254.1.3
> ucast hb1 169.254.1.4
> ucast bond0 172.28.102.21
> ucast bond0 172.28.102.51
> compression zlib
> compression_threshold 30

I suggest you try
compression bz2
compression_threshold 30
traditional_compression yes

The reason is: "traditional compression" compresses the full packet,
if the uncompressed message size exceeds the compression_threshold.
non-traditional compression compresses only message field values which
are marked as to-be-compressed, and unfortunately, pacemaker does not
always mark larger message fields in this way.

Note that you are still limitted to <64kB in total [*]
so in case you have a huge cib (many nodes, many resources,
especially many cloned resources), in particular the
the status section of the cib may grow too large.

[*]
theoretical maximum payload of a single UDP datagram; the heartbeat
messaging layer does not spread message payload on multiple datagrams,
and that is unlikely to change, unless someone really invests
non-trivial amounts of developer time and money into extending it

You probably should consider moving to corosync (>= 1.4.x ),
which spreads messages over as many datagrams as needed,
up to a maximum message size of 1 MByte, iirc.

Note that I avoid the term "fragment" here, because each datagram itself
typically will be fragmented into pieces of < MTU size.

In any case, obviously you need a very reliable network stack: if you
need more fragments you need to transmit a single message, you can
tolerate less fragment loss. And UDP fragments may be among the first
things that get dropped on the floor if the network stack experiences
memory pressure.

> # msgfmt
> msgfmt netstring
> 
> # a node will be flagged as dead if there is not response for 20 seconds
> deadtime 30
> initdead 30
> keepalive 250ms
> uuidfrom nodename
> 
> # these are the node names participating in the cluster
> # the names should match "uname -n" output on the system
> node usrv-qpr2
> node usrv-qpr5
> 
> We can ping all interfaces from both nodes.  One of the bonded NICs
> had some trouble, but we believe we have enough redundancy built in
> that it should be fine.  The issue we see that if we reboot the non DC
> node it can no longer sync with the DC.  The log from the non-dc node
> shows remote node cannot be reached.  Crm_mon of the non-dc node
> shows:
> 
> Last updated: Fri Aug 19 07:39:05 2011
> Stack: Heartbeat
> Current DC: NONE
> 2 Nodes configured, 2 expected votes
> 26 Resources configured.
> ============
> 
> Node usrv-qpr2 (87df4a75-fa67-c05e-1a07-641fa79784e0): UNCLEAN (offline)
> Node usrv-qpr5 (7fb57f74-fae5-d493-e2c7-e4eda2430217): UNCLEAN (offline)
> 
> From the DC it looks like all is well.
> 
> I tried a cibadmin -Q from non DC and it can no longer contact the remote node.
> 
> I tried a cibadmin -S from the non DC to force a sync which times out with Call cib_sync failed (-41): Remote node did not respond.
> 
> On the DC side I see this:
> Aug 19 07:38:20 usrv-qpr2 heartbeat: [23249]: ERROR: write_child: write failure on ucast bond0.: Message too long
> Aug 19 07:38:20 usrv-qpr2 heartbeat: [23251]: ERROR: glib: ucast_write: Unable to send HBcomm packet bond0 172.28.102.51:694 len=83696 [-1]: Message too long
> Aug 19 07:38:20 usrv-qpr2 heartbeat: [23251]: ERROR: write_child: write failure on ucast bond0.: Message too long
> Aug 19 07:38:20 usrv-qpr2 heartbeat: [23253]: ERROR: glib: ucast_write: Unable to send HBcomm packet hb1 169.254.1.3:694 len=83696 [-1]: Message too long
> Aug 19 07:38:20 usrv-qpr2 heartbeat: [23253]: ERROR: write_child: write failure on ucast hb1.: Message too long
> Aug 19 07:38:20 usrv-qpr2 heartbeat: [23255]: ERROR: glib: ucast_write: Unable to send HBcomm packet hb1 169.254.1.4:694 len=83696 [-1]: Message too long
> Aug 19 07:38:20 usrv-qpr2 heartbeat: [23255]: ERROR: write_child: write failure on ucast hb1.: Message too long
> Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is filling up (500 messages in queue)
> Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is filling up (500 messages in queue)
> Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Cannot rexmit pkt 244442 for usrv-qpr5: seqno too low
> Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: fromnode =usrv-qpr5, fromnode's ackseq = 244435
> Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hist information:
> Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hiseq =244943, lowseq=244443,ackseq=244435,lastmsg=442
> Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Cannot rexmit pkt 244442 for usrv-qpr5: seqno too low
> Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: fromnode =usrv-qpr5, fromnode's ackseq = 244435
> Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hist information:
> Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hiseq =244943, lowseq=244443,ackseq=244435,lastmsg=442
> Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is filling up (500 messages in queue)
> Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is filling up (500 messages in queue)
> Aug 19 07:38:22 usrv-qpr2 heartbeat: [23222]: info: all clients are now resumed
> 
> My questions:
> 
> 1)      Seems like the compression is not working.  Is there something
> we need to do to enable it?  We have tried both bz2 and  zlib.  We've
> played with the compression threshold as well.

See above.
Because pacemaker sometimes does not mark large message field values as
"should-be-compressed" in the heartbeat message api way, you need
"traditional_compression on", to allow heartbeat to compress the full
message instead.

> 2)      How do we get the non DC system back on-line?  Rebooting does not work since the DC can't seem to send the diffs to sync it.
> 
> 3)      If the diff it is trying to send is truly too long, how do I recover from that?

Sometimes pacemaker needs to send the full cib.
The cib, particularly the status section, will grow over time, as it
accumulates probing, monitoring, and other action results.

If you start off with a cib that is too large, you are out of luck.
If you start with a cib that fits, it still may grow too large over
time, so you may need to do some "special maintenance" there,
delete "outdated" status results in time by hand or similar.

Probably rather consider using corosync instead in that case,
or reducing the number of your services/clones.

> 4)      Would more information be useful in diagnosing the problem?

I don't think so.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.