<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:p="urn:schemas-microsoft-com:office:powerpoint" xmlns:a="urn:schemas-microsoft-com:office:access" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:s="uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882" xmlns:rs="urn:schemas-microsoft-com:rowset" xmlns:z="#RowsetSchema" xmlns:b="urn:schemas-microsoft-com:office:publisher" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:c="urn:schemas-microsoft-com:office:component:spreadsheet" xmlns:odc="urn:schemas-microsoft-com:office:odc" xmlns:oa="urn:schemas-microsoft-com:office:activation" xmlns:html="http://www.w3.org/TR/REC-html40" xmlns:q="http://schemas.xmlsoap.org/soap/envelope/" xmlns:rtc="http://microsoft.com/officenet/conferencing" xmlns:D="DAV:" xmlns:Repl="http://schemas.microsoft.com/repl/" xmlns:mt="http://schemas.microsoft.com/sharepoint/soap/meetings/" xmlns:x2="http://schemas.microsoft.com/office/excel/2003/xml" xmlns:ppda="http://www.passport.com/NameSpace.xsd" xmlns:ois="http://schemas.microsoft.com/sharepoint/soap/ois/" xmlns:dir="http://schemas.microsoft.com/sharepoint/soap/directory/" xmlns:ds="http://www.w3.org/2000/09/xmldsig#" xmlns:dsp="http://schemas.microsoft.com/sharepoint/dsp" xmlns:udc="http://schemas.microsoft.com/data/udc" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:sub="http://schemas.microsoft.com/sharepoint/soap/2002/1/alerts/" xmlns:ec="http://www.w3.org/2001/04/xmlenc#" xmlns:sp="http://schemas.microsoft.com/sharepoint/" xmlns:sps="http://schemas.microsoft.com/sharepoint/soap/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:udcs="http://schemas.microsoft.com/data/udc/soap" xmlns:udcxf="http://schemas.microsoft.com/data/udc/xmlfile" xmlns:udcp2p="http://schemas.microsoft.com/data/udc/parttopart" xmlns:wf="http://schemas.microsoft.com/sharepoint/soap/workflow/" xmlns:dsss="http://schemas.microsoft.com/office/2006/digsig-setup" xmlns:dssi="http://schemas.microsoft.com/office/2006/digsig" xmlns:mdssi="http://schemas.openxmlformats.org/package/2006/digital-signature" xmlns:mver="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns:mrels="http://schemas.openxmlformats.org/package/2006/relationships" xmlns:spwp="http://microsoft.com/sharepoint/webpartpages" xmlns:ex12t="http://schemas.microsoft.com/exchange/services/2006/types" xmlns:ex12m="http://schemas.microsoft.com/exchange/services/2006/messages" xmlns:pptsl="http://schemas.microsoft.com/sharepoint/soap/SlideLibrary/" xmlns:spsl="http://microsoft.com/webservices/SharePointPortalServer/PublishedLinksService" xmlns:Z="urn:schemas-microsoft-com:" xmlns:st="" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=us-ascii"><meta name=Generator content="Microsoft Word 12 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Tahoma;
panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
p.MsoAcetate, li.MsoAcetate, div.MsoAcetate
{mso-style-priority:99;
mso-style-link:"Balloon Text Char";
margin:0in;
margin-bottom:.0001pt;
font-size:8.0pt;
font-family:"Tahoma","sans-serif";}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
{mso-style-priority:34;
margin-top:0in;
margin-right:0in;
margin-bottom:0in;
margin-left:.5in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri","sans-serif";}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri","sans-serif";
color:windowtext;}
span.BalloonTextChar
{mso-style-name:"Balloon Text Char";
mso-style-priority:99;
mso-style-link:"Balloon Text";
font-family:"Tahoma","sans-serif";}
.MsoChpDefault
{mso-style-type:export-only;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
/* List Definitions */
@list l0
{mso-list-id:103622033;
mso-list-type:hybrid;
mso-list-template-ids:-722439732 67698705 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l0:level1
{mso-level-text:"%1\)";
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
ol
{margin-bottom:0in;}
ul
{margin-bottom:0in;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=EN-US link=blue vlink=purple><div class=WordSection1><p class=MsoNormal>Hi,<o:p></o:p></p><p class=MsoNormal> We are running a two-node cluster using pacemaker 1.1.5-18.1 with heartbeat 3.0.4-41.1. We are experiencing what seems like network issues and cannot make heartbeat recover. We are experiencing “message too long” and the systems can no longer sync. <o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>Our ha.cf is as follows:<o:p></o:p></p><p class=MsoNormal>autojoin none<o:p></o:p></p><p class=MsoNormal>use_logd false<o:p></o:p></p><p class=MsoNormal>logfacility daemon<o:p></o:p></p><p class=MsoNormal>debug 0<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal># use the v2 cluster resource manager<o:p></o:p></p><p class=MsoNormal>crm yes<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal># the cluster communication happens via unicast on bond0 and hb1<o:p></o:p></p><p class=MsoNormal># hb1 is direct connect<o:p></o:p></p><p class=MsoNormal>ucast hb1 169.254.1.3<o:p></o:p></p><p class=MsoNormal>ucast hb1 169.254.1.4<o:p></o:p></p><p class=MsoNormal>ucast bond0 172.28.102.21<o:p></o:p></p><p class=MsoNormal>ucast bond0 172.28.102.51<o:p></o:p></p><p class=MsoNormal>compression zlib<o:p></o:p></p><p class=MsoNormal>compression_threshold 30<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal># msgfmt<o:p></o:p></p><p class=MsoNormal>msgfmt netstring<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal># a node will be flagged as dead if there is not response for 20 seconds<o:p></o:p></p><p class=MsoNormal>deadtime 30<o:p></o:p></p><p class=MsoNormal>initdead 30<o:p></o:p></p><p class=MsoNormal>keepalive 250ms<o:p></o:p></p><p class=MsoNormal>uuidfrom nodename<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal># these are the node names participating in the cluster<o:p></o:p></p><p class=MsoNormal># the names should match "uname -n" output on the system<o:p></o:p></p><p class=MsoNormal>node usrv-qpr2<o:p></o:p></p><p class=MsoNormal>node usrv-qpr5<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>We can ping all interfaces from both nodes. One of the bonded NICs had some trouble, but we believe we have enough redundancy built in that it should be fine.<o:p></o:p></p><p class=MsoNormal>The issue we see that if we reboot the non DC node it can no longer sync with the DC. The log from the non-dc node shows remote node cannot be reached. Crm_mon of the non-dc node shows:<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>Last updated: Fri Aug 19 07:39:05 2011<o:p></o:p></p><p class=MsoNormal>Stack: Heartbeat<o:p></o:p></p><p class=MsoNormal>Current DC: NONE<o:p></o:p></p><p class=MsoNormal>2 Nodes configured, 2 expected votes<o:p></o:p></p><p class=MsoNormal>26 Resources configured.<o:p></o:p></p><p class=MsoNormal>============<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>Node usrv-qpr2 (87df4a75-fa67-c05e-1a07-641fa79784e0): UNCLEAN (offline)<o:p></o:p></p><p class=MsoNormal>Node usrv-qpr5 (7fb57f74-fae5-d493-e2c7-e4eda2430217): UNCLEAN (offline)<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>From the DC it looks like all is well.<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>I tried a cibadmin –Q from non DC and it can no longer contact the remote node.<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>I tried a cibadmin –S from the non DC to force a sync which times out with Call cib_sync failed (-41): Remote node did not respond.<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>On the DC side I see this:<o:p></o:p></p><p class=MsoNormal>Aug 19 07:38:20 usrv-qpr2 heartbeat: [23249]: ERROR: write_child: write failure on ucast bond0.: Message too long<o:p></o:p></p><p class=MsoNormal>Aug 19 07:38:20 usrv-qpr2 heartbeat: [23251]: ERROR: glib: ucast_write: Unable to send HBcomm packet bond0 172.28.102.51:694 len=83696 [-1]: Message too long<o:p></o:p></p><p class=MsoNormal>Aug 19 07:38:20 usrv-qpr2 heartbeat: [23251]: ERROR: write_child: write failure on ucast bond0.: Message too long<o:p></o:p></p><p class=MsoNormal>Aug 19 07:38:20 usrv-qpr2 heartbeat: [23253]: ERROR: glib: ucast_write: Unable to send HBcomm packet hb1 169.254.1.3:694 len=83696 [-1]: Message too long<o:p></o:p></p><p class=MsoNormal>Aug 19 07:38:20 usrv-qpr2 heartbeat: [23253]: ERROR: write_child: write failure on ucast hb1.: Message too long<o:p></o:p></p><p class=MsoNormal>Aug 19 07:38:20 usrv-qpr2 heartbeat: [23255]: ERROR: glib: ucast_write: Unable to send HBcomm packet hb1 169.254.1.4:694 len=83696 [-1]: Message too long<o:p></o:p></p><p class=MsoNormal>Aug 19 07:38:20 usrv-qpr2 heartbeat: [23255]: ERROR: write_child: write failure on ucast hb1.: Message too long<o:p></o:p></p><p class=MsoNormal>Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is filling up (500 messages in queue)<o:p></o:p></p><p class=MsoNormal>Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is filling up (500 messages in queue)<o:p></o:p></p><p class=MsoNormal>Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Cannot rexmit pkt 244442 for usrv-qpr5: seqno too low<o:p></o:p></p><p class=MsoNormal>Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: fromnode =usrv-qpr5, fromnode's ackseq = 244435<o:p></o:p></p><p class=MsoNormal>Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hist information:<o:p></o:p></p><p class=MsoNormal>Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hiseq =244943, lowseq=244443,ackseq=244435,lastmsg=442<o:p></o:p></p><p class=MsoNormal>Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Cannot rexmit pkt 244442 for usrv-qpr5: seqno too low<o:p></o:p></p><p class=MsoNormal>Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: fromnode =usrv-qpr5, fromnode's ackseq = 244435<o:p></o:p></p><p class=MsoNormal>Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hist information:<o:p></o:p></p><p class=MsoNormal>Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hiseq =244943, lowseq=244443,ackseq=244435,lastmsg=442<o:p></o:p></p><p class=MsoNormal>Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is filling up (500 messages in queue)<o:p></o:p></p><p class=MsoNormal>Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is filling up (500 messages in queue)<o:p></o:p></p><p class=MsoNormal>Aug 19 07:38:22 usrv-qpr2 heartbeat: [23222]: info: all clients are now resumed<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>My questions:<o:p></o:p></p><p class=MsoListParagraph style='text-indent:-.25in;mso-list:l0 level1 lfo1'><![if !supportLists]><span style='mso-list:Ignore'>1)<span style='font:7.0pt "Times New Roman"'> </span></span><![endif]>Seems like the compression is not working. Is there something we need to do to enable it? We have tried both bz2 and zlib. We’ve played with the compression threshold as well.<o:p></o:p></p><p class=MsoListParagraph style='text-indent:-.25in;mso-list:l0 level1 lfo1'><![if !supportLists]><span style='mso-list:Ignore'>2)<span style='font:7.0pt "Times New Roman"'> </span></span><![endif]>How do we get the non DC system back on-line? Rebooting does not work since the DC can’t seem to send the diffs to sync it.<o:p></o:p></p><p class=MsoListParagraph style='text-indent:-.25in;mso-list:l0 level1 lfo1'><![if !supportLists]><span style='mso-list:Ignore'>3)<span style='font:7.0pt "Times New Roman"'> </span></span><![endif]>If the diff it is trying to send is truly too long, how do I recover from that?<o:p></o:p></p><p class=MsoListParagraph style='text-indent:-.25in;mso-list:l0 level1 lfo1'><![if !supportLists]><span style='mso-list:Ignore'>4)<span style='font:7.0pt "Times New Roman"'> </span></span><![endif]>Would more information be useful in diagnosing the problem?<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>Thanks in advance.<o:p></o:p></p><p class=MsoNormal>Diane Schaefer<o:p></o:p></p></div></body></html>