No subject


Sun Apr 3 02:52:37 EDT 2011


I tried a cibadmin -Q from non DC and it can no longer contact the remote n=
ode.

I tried a cibadmin -S from the non DC to force a sync which times out with =
Call cib_sync failed (-41): Remote node did not respond.

On the DC side I see this:
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23249]: ERROR: write_child: write fai=
lure on ucast bond0.: Message too long
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23251]: ERROR: glib: ucast_write: Una=
ble to send HBcomm packet bond0 172.28.102.51:694 len=3D83696 [-1]: Message=
 too long
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23251]: ERROR: write_child: write fai=
lure on ucast bond0.: Message too long
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23253]: ERROR: glib: ucast_write: Una=
ble to send HBcomm packet hb1 169.254.1.3:694 len=3D83696 [-1]: Message too=
 long
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23253]: ERROR: write_child: write fai=
lure on ucast hb1.: Message too long
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23255]: ERROR: glib: ucast_write: Una=
ble to send HBcomm packet hb1 169.254.1.4:694 len=3D83696 [-1]: Message too=
 long
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23255]: ERROR: write_child: write fai=
lure on ucast hb1.: Message too long
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is =
filling up (500 messages in queue)
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is =
filling up (500 messages in queue)
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Cannot rexmit pkt 2444=
42 for usrv-qpr5: seqno too low
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: fromnode =3Dusrv-qpr5, =
fromnode's ackseq =3D 244435
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hist information:
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hiseq =3D244943, lowseq=
=3D244443,ackseq=3D244435,lastmsg=3D442
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Cannot rexmit pkt 2444=
42 for usrv-qpr5: seqno too low
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: fromnode =3Dusrv-qpr5, =
fromnode's ackseq =3D 244435
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hist information:
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hiseq =3D244943, lowseq=
=3D244443,ackseq=3D244435,lastmsg=3D442
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is =
filling up (500 messages in queue)
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is =
filling up (500 messages in queue)
Aug 19 07:38:22 usrv-qpr2 heartbeat: [23222]: info: all clients are now res=
umed

My questions:

1)      Seems like the compression is not working.  Is there something we n=
eed to do to enable it?  We have tried both bz2 and  zlib.  We've played wi=
th the compression threshold as well.

2)      How do we get the non DC system back on-line?  Rebooting does not w=
ork since the DC can't seem to send the diffs to sync it.

3)      If the diff it is trying to send is truly too long, how do I recove=
r from that?

4)      Would more information be useful in diagnosing the problem?

Thanks in advance.
Diane Schaefer

--_000_63D5DCACD1E9E34C89C8203C64F521C3FE4FD6D9ADUSEAEXCH7naui_
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<html xmlns:v=3D"urn:schemas-microsoft-com:vml" xmlns:o=3D"urn:schemas-micr=
osoft-com:office:office" xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:x=3D"urn:schemas-microsoft-com:office:excel" xmlns:p=3D"urn:schemas-m=
icrosoft-com:office:powerpoint" xmlns:a=3D"urn:schemas-microsoft-com:office=
:access" xmlns:dt=3D"uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:s=3D"=
uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882" xmlns:rs=3D"urn:schemas-microsof=
t-com:rowset" xmlns:z=3D"#RowsetSchema" xmlns:b=3D"urn:schemas-microsoft-co=
m:office:publisher" xmlns:ss=3D"urn:schemas-microsoft-com:office:spreadshee=
t" xmlns:c=3D"urn:schemas-microsoft-com:office:component:spreadsheet" xmlns=
:odc=3D"urn:schemas-microsoft-com:office:odc" xmlns:oa=3D"urn:schemas-micro=
soft-com:office:activation" xmlns:html=3D"http://www.w3.org/TR/REC-html40" =
xmlns:q=3D"http://schemas.xmlsoap.org/soap/envelope/" xmlns:rtc=3D"http://m=
icrosoft.com/officenet/conferencing" xmlns:D=3D"DAV:" xmlns:Repl=3D"http://=
schemas.microsoft.com/repl/" xmlns:mt=3D"http://schemas.microsoft.com/share=
point/soap/meetings/" xmlns:x2=3D"http://schemas.microsoft.com/office/excel=
/2003/xml" xmlns:ppda=3D"http://www.passport.com/NameSpace.xsd" xmlns:ois=
=3D"http://schemas.microsoft.com/sharepoint/soap/ois/" xmlns:dir=3D"http://=
schemas.microsoft.com/sharepoint/soap/directory/" xmlns:ds=3D"http://www.w3=
.org/2000/09/xmldsig#" xmlns:dsp=3D"http://schemas.microsoft.com/sharepoint=
/dsp" xmlns:udc=3D"http://schemas.microsoft.com/data/udc" xmlns:xsd=3D"http=
://www.w3.org/2001/XMLSchema" xmlns:sub=3D"http://schemas.microsoft.com/sha=
repoint/soap/2002/1/alerts/" xmlns:ec=3D"http://www.w3.org/2001/04/xmlenc#"=
 xmlns:sp=3D"http://schemas.microsoft.com/sharepoint/" xmlns:sps=3D"http://=
schemas.microsoft.com/sharepoint/soap/" xmlns:xsi=3D"http://www.w3.org/2001=
/XMLSchema-instance" xmlns:udcs=3D"http://schemas.microsoft.com/data/udc/so=
ap" xmlns:udcxf=3D"http://schemas.microsoft.com/data/udc/xmlfile" xmlns:udc=
p2p=3D"http://schemas.microsoft.com/data/udc/parttopart" xmlns:wf=3D"http:/=
/schemas.microsoft.com/sharepoint/soap/workflow/" xmlns:dsss=3D"http://sche=
mas.microsoft.com/office/2006/digsig-setup" xmlns:dssi=3D"http://schemas.mi=
crosoft.com/office/2006/digsig" xmlns:mdssi=3D"http://schemas.openxmlformat=
s.org/package/2006/digital-signature" xmlns:mver=3D"http://schemas.openxmlf=
ormats.org/markup-compatibility/2006" xmlns:m=3D"http://schemas.microsoft.c=
om/office/2004/12/omml" xmlns:mrels=3D"http://schemas.openxmlformats.org/pa=
ckage/2006/relationships" xmlns:spwp=3D"http://microsoft.com/sharepoint/web=
partpages" xmlns:ex12t=3D"http://schemas.microsoft.com/exchange/services/20=
06/types" xmlns:ex12m=3D"http://schemas.microsoft.com/exchange/services/200=
6/messages" xmlns:pptsl=3D"http://schemas.microsoft.com/sharepoint/soap/Sli=
deLibrary/" xmlns:spsl=3D"http://microsoft.com/webservices/SharePointPortal=
Server/PublishedLinksService" xmlns:Z=3D"urn:schemas-microsoft-com:" xmlns:=
st=3D"&#1;" xmlns=3D"http://www.w3.org/TR/REC-html40"><head><meta http-equi=
v=3DContent-Type content=3D"text/html; charset=3Dus-ascii"><meta name=3DGen=
erator content=3D"Microsoft Word 12 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
	{font-family:Tahoma;
	panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0in;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
p.MsoAcetate, li.MsoAcetate, div.MsoAcetate
	{mso-style-priority:99;
	mso-style-link:"Balloon Text Char";
	margin:0in;
	margin-bottom:.0001pt;
	font-size:8.0pt;
	font-family:"Tahoma","sans-serif";}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
	{mso-style-priority:34;
	margin-top:0in;
	margin-right:0in;
	margin-bottom:0in;
	margin-left:.5in;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri","sans-serif";}
span.EmailStyle17
	{mso-style-type:personal-compose;
	font-family:"Calibri","sans-serif";
	color:windowtext;}
span.BalloonTextChar
	{mso-style-name:"Balloon Text Char";
	mso-style-priority:99;
	mso-style-link:"Balloon Text";
	font-family:"Tahoma","sans-serif";}
.MsoChpDefault
	{mso-style-type:export-only;}
@page WordSection1
	{size:8.5in 11.0in;
	margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
	{page:WordSection1;}
/* List Definitions */
@list l0
	{mso-list-id:103622033;
	mso-list-type:hybrid;
	mso-list-template-ids:-722439732 67698705 67698713 67698715 67698703 67698=
713 67698715 67698703 67698713 67698715;}
@list l0:level1
	{mso-level-text:"%1\)";
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	text-indent:-.25in;}
ol
	{margin-bottom:0in;}
ul
	{margin-bottom:0in;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext=3D"edit">
<o:idmap v:ext=3D"edit" data=3D"1" />
</o:shapelayout></xml><![endif]--></head><body lang=3DEN-US link=3Dblue vli=
nk=3Dpurple><div class=3DWordSection1><p class=3DMsoNormal>Hi,<o:p></o:p></=
p><p class=3DMsoNormal>&nbsp; We are running a two-node cluster using pacem=
aker 1.1.5-18.1 with heartbeat 3.0.4-41.1.&nbsp; We are experiencing what s=
eems like network issues and cannot make heartbeat recover.&nbsp; We are ex=
periencing &#8220;message too long&#8221; and the systems can no longer syn=
c. <o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoN=
ormal>Our ha.cf is as follows:<o:p></o:p></p><p class=3DMsoNormal>autojoin =
none<o:p></o:p></p><p class=3DMsoNormal>use_logd false<o:p></o:p></p><p cla=
ss=3DMsoNormal>logfacility daemon<o:p></o:p></p><p class=3DMsoNormal>debug =
0<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNor=
mal># use the v2 cluster resource manager<o:p></o:p></p><p class=3DMsoNorma=
l>crm yes<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=
=3DMsoNormal># the cluster communication happens via unicast on bond0 and h=
b1<o:p></o:p></p><p class=3DMsoNormal># hb1 is direct connect<o:p></o:p></p=
><p class=3DMsoNormal>ucast hb1 169.254.1.3<o:p></o:p></p><p class=3DMsoNor=
mal>ucast hb1 169.254.1.4<o:p></o:p></p><p class=3DMsoNormal>ucast bond0 17=
2.28.102.21<o:p></o:p></p><p class=3DMsoNormal>ucast bond0 172.28.102.51<o:=
p></o:p></p><p class=3DMsoNormal>compression zlib<o:p></o:p></p><p class=3D=
MsoNormal>compression_threshold 30<o:p></o:p></p><p class=3DMsoNormal><o:p>=
&nbsp;</o:p></p><p class=3DMsoNormal># msgfmt<o:p></o:p></p><p class=3DMsoN=
ormal>msgfmt netstring<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p>=
</p><p class=3DMsoNormal># a node will be flagged as dead if there is not r=
esponse for 20 seconds<o:p></o:p></p><p class=3DMsoNormal>deadtime 30<o:p><=
/o:p></p><p class=3DMsoNormal>initdead 30<o:p></o:p></p><p class=3DMsoNorma=
l>keepalive 250ms<o:p></o:p></p><p class=3DMsoNormal>uuidfrom nodename<o:p>=
</o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal># =
these are the node names participating in the cluster<o:p></o:p></p><p clas=
s=3DMsoNormal># the names should match &quot;uname -n&quot; output on the s=
ystem<o:p></o:p></p><p class=3DMsoNormal>node usrv-qpr2<o:p></o:p></p><p cl=
ass=3DMsoNormal>node usrv-qpr5<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbs=
p;</o:p></p><p class=3DMsoNormal>We can ping all interfaces from both nodes=
.&nbsp; One of the bonded NICs had some trouble, but we believe we have eno=
ugh redundancy built in that it should be fine.<o:p></o:p></p><p class=3DMs=
oNormal>The issue we see that if we reboot the non DC node it can no longer=
 sync with the DC.&nbsp; The log from the non-dc node shows remote node can=
not be reached.&nbsp; Crm_mon of the non-dc node shows:<o:p></o:p></p><p cl=
ass=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>Last updated: Fri=
 Aug 19 07:39:05 2011<o:p></o:p></p><p class=3DMsoNormal>Stack: Heartbeat<o=
:p></o:p></p><p class=3DMsoNormal>Current DC: NONE<o:p></o:p></p><p class=
=3DMsoNormal>2 Nodes configured, 2 expected votes<o:p></o:p></p><p class=3D=
MsoNormal>26 Resources configured.<o:p></o:p></p><p class=3DMsoNormal>=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<o:p></o:p></p><p class=3DMsoNormal><o:p>&=
nbsp;</o:p></p><p class=3DMsoNormal>Node usrv-qpr2 (87df4a75-fa67-c05e-1a07=
-641fa79784e0): UNCLEAN (offline)<o:p></o:p></p><p class=3DMsoNormal>Node u=
srv-qpr5 (7fb57f74-fae5-d493-e2c7-e4eda2430217): UNCLEAN (offline)<o:p></o:=
p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>From t=
he DC it looks like all is well.<o:p></o:p></p><p class=3DMsoNormal><o:p>&n=
bsp;</o:p></p><p class=3DMsoNormal>I tried a cibadmin &#8211;Q from non DC =
and it can no longer contact the remote node.<o:p></o:p></p><p class=3DMsoN=
ormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>I tried a cibadmin &#8211;S=
 from the non DC to force a sync which times out with Call cib_sync failed =
(-41): Remote node did not respond.<o:p></o:p></p><p class=3DMsoNormal><o:p=
>&nbsp;</o:p></p><p class=3DMsoNormal>On the DC side I see this:<o:p></o:p>=
</p><p class=3DMsoNormal>Aug 19 07:38:20 usrv-qpr2 heartbeat: [23249]: ERRO=
R: write_child: write failure on ucast bond0.: Message too long<o:p></o:p><=
/p><p class=3DMsoNormal>Aug 19 07:38:20 usrv-qpr2 heartbeat: [23251]: ERROR=
: glib: ucast_write: Unable to send HBcomm packet bond0 172.28.102.51:694 l=
en=3D83696 [-1]: Message too long<o:p></o:p></p><p class=3DMsoNormal>Aug 19=
 07:38:20 usrv-qpr2 heartbeat: [23251]: ERROR: write_child: write failure o=
n ucast bond0.: Message too long<o:p></o:p></p><p class=3DMsoNormal>Aug 19 =
07:38:20 usrv-qpr2 heartbeat: [23253]: ERROR: glib: ucast_write: Unable to =
send HBcomm packet hb1 169.254.1.3:694 len=3D83696 [-1]: Message too long<o=
:p></o:p></p><p class=3DMsoNormal>Aug 19 07:38:20 usrv-qpr2 heartbeat: [232=
53]: ERROR: write_child: write failure on ucast hb1.: Message too long<o:p>=
</o:p></p><p class=3DMsoNormal>Aug 19 07:38:20 usrv-qpr2 heartbeat: [23255]=
: ERROR: glib: ucast_write: Unable to send HBcomm packet hb1 169.254.1.4:69=
4 len=3D83696 [-1]: Message too long<o:p></o:p></p><p class=3DMsoNormal>Aug=
 19 07:38:20 usrv-qpr2 heartbeat: [23255]: ERROR: write_child: write failur=
e on ucast hb1.: Message too long<o:p></o:p></p><p class=3DMsoNormal>Aug 19=
 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is fillin=
g up (500 messages in queue)<o:p></o:p></p><p class=3DMsoNormal>Aug 19 07:3=
8:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is filling up =
(500 messages in queue)<o:p></o:p></p><p class=3DMsoNormal>Aug 19 07:38:21 =
usrv-qpr2 heartbeat: [23222]: ERROR: Cannot rexmit pkt 244442 for usrv-qpr5=
: seqno too low<o:p></o:p></p><p class=3DMsoNormal>Aug 19 07:38:21 usrv-qpr=
2 heartbeat: [23222]: info: fromnode =3Dusrv-qpr5, fromnode's ackseq =3D 24=
4435<o:p></o:p></p><p class=3DMsoNormal>Aug 19 07:38:21 usrv-qpr2 heartbeat=
: [23222]: info: hist information:<o:p></o:p></p><p class=3DMsoNormal>Aug 1=
9 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hiseq =3D244943, lowseq=3D24=
4443,ackseq=3D244435,lastmsg=3D442<o:p></o:p></p><p class=3DMsoNormal>Aug 1=
9 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Cannot rexmit pkt 244442 fo=
r usrv-qpr5: seqno too low<o:p></o:p></p><p class=3DMsoNormal>Aug 19 07:38:=
21 usrv-qpr2 heartbeat: [23222]: info: fromnode =3Dusrv-qpr5, fromnode's ac=
kseq =3D 244435<o:p></o:p></p><p class=3DMsoNormal>Aug 19 07:38:21 usrv-qpr=
2 heartbeat: [23222]: info: hist information:<o:p></o:p></p><p class=3DMsoN=
ormal>Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hiseq =3D244943, =
lowseq=3D244443,ackseq=3D244435,lastmsg=3D442<o:p></o:p></p><p class=3DMsoN=
ormal>Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist que=
ue is filling up (500 messages in queue)<o:p></o:p></p><p class=3DMsoNormal=
>Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is=
 filling up (500 messages in queue)<o:p></o:p></p><p class=3DMsoNormal>Aug =
19 07:38:22 usrv-qpr2 heartbeat: [23222]: info: all clients are now resumed=
<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNorm=
al>My questions:<o:p></o:p></p><p class=3DMsoListParagraph style=3D'text-in=
dent:-.25in;mso-list:l0 level1 lfo1'><![if !supportLists]><span style=3D'ms=
o-list:Ignore'>1)<span style=3D'font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp; </span></span><![endif]>Seems like the compression is not=
 working.&nbsp; Is there something we need to do to enable it?&nbsp; We hav=
e tried both bz2 and &nbsp;zlib.&nbsp; We&#8217;ve played with the compress=
ion threshold as well.<o:p></o:p></p><p class=3DMsoListParagraph style=3D't=
ext-indent:-.25in;mso-list:l0 level1 lfo1'><![if !supportLists]><span style=
=3D'mso-list:Ignore'>2)<span style=3D'font:7.0pt "Times New Roman"'>&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp; </span></span><![endif]>How do we get the non DC sy=
stem back on-line?&nbsp; Rebooting does not work since the DC can&#8217;t s=
eem to send the diffs to sync it.<o:p></o:p></p><p class=3DMsoListParagraph=
 style=3D'text-indent:-.25in;mso-list:l0 level1 lfo1'><![if !supportLists]>=
<span style=3D'mso-list:Ignore'>3)<span style=3D'font:7.0pt "Times New Roma=
n"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><![endif]>If the diff it i=
s trying to send is truly too long, how do I recover from that?<o:p></o:p><=
/p><p class=3DMsoListParagraph style=3D'text-indent:-.25in;mso-list:l0 leve=
l1 lfo1'><![if !supportLists]><span style=3D'mso-list:Ignore'>4)<span style=
=3D'font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></s=
pan><![endif]>Would more information be useful in diagnosing the problem?<o=
:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal=
>Thanks in advance.<o:p></o:p></p><p class=3DMsoNormal>Diane Schaefer<o:p><=
/o:p></p></div></body></html>=

--_000_63D5DCACD1E9E34C89C8203C64F521C3FE4FD6D9ADUSEAEXCH7naui_--



More information about the Pacemaker mailing list