No subject

Sun Apr 3 06:52:37 UTC 2011

I tried a cibadmin -Q from non DC and it can no longer contact the remote n=

I tried a cibadmin -S from the non DC to force a sync which times out with =
Call cib_sync failed (-41): Remote node did not respond.

On the DC side I see this:
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23249]: ERROR: write_child: write fai=
lure on ucast bond0.: Message too long
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23251]: ERROR: glib: ucast_write: Una=
ble to send HBcomm packet bond0 len=3D83696 [-1]: Message=
 too long
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23251]: ERROR: write_child: write fai=
lure on ucast bond0.: Message too long
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23253]: ERROR: glib: ucast_write: Una=
ble to send HBcomm packet hb1 len=3D83696 [-1]: Message too=
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23253]: ERROR: write_child: write fai=
lure on ucast hb1.: Message too long
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23255]: ERROR: glib: ucast_write: Una=
ble to send HBcomm packet hb1 len=3D83696 [-1]: Message too=
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23255]: ERROR: write_child: write fai=
lure on ucast hb1.: Message too long
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is =
filling up (500 messages in queue)
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is =
filling up (500 messages in queue)
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Cannot rexmit pkt 2444=
42 for usrv-qpr5: seqno too low
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: fromnode =3Dusrv-qpr5, =
fromnode's ackseq =3D 244435
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hist information:
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hiseq =3D244943, lowseq=
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Cannot rexmit pkt 2444=
42 for usrv-qpr5: seqno too low
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: fromnode =3Dusrv-qpr5, =
fromnode's ackseq =3D 244435
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hist information:
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hiseq =3D244943, lowseq=
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is =
filling up (500 messages in queue)
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is =
filling up (500 messages in queue)
Aug 19 07:38:22 usrv-qpr2 heartbeat: [23222]: info: all clients are now res=

My questions:

1)      Seems like the compression is not working.  Is there something we n=
eed to do to enable it?  We have tried both bz2 and  zlib.  We've played wi=
th the compression threshold as well.

2)      How do we get the non DC system back on-line?  Rebooting does not w=
ork since the DC can't seem to send the diffs to sync it.

3)      If the diff it is trying to send is truly too long, how do I recove=
r from that?

4)      Would more information be useful in diagnosing the problem?

Thanks in advance.
Diane Schaefer

Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<html xmlns:v=3D"urn:schemas-microsoft-com:vml" xmlns:o=3D"urn:schemas-micr=
osoft-com:office:office" xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:x=3D"urn:schemas-microsoft-com:office:excel" xmlns:p=3D"urn:schemas-m=
icrosoft-com:office:powerpoint" xmlns:a=3D"urn:schemas-microsoft-com:office=
:access" xmlns:dt=3D"uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:s=3D"=
uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882" xmlns:rs=3D"urn:schemas-microsof=
t-com:rowset" xmlns:z=3D"#RowsetSchema" xmlns:b=3D"urn:schemas-microsoft-co=
m:office:publisher" xmlns:ss=3D"urn:schemas-microsoft-com:office:spreadshee=
t" xmlns:c=3D"urn:schemas-microsoft-com:office:component:spreadsheet" xmlns=
:odc=3D"urn:schemas-microsoft-com:office:odc" xmlns:oa=3D"urn:schemas-micro=
soft-com:office:activation" xmlns:html=3D"" =
xmlns:q=3D"" xmlns:rtc=3D"http://m=" xmlns:D=3D"DAV:" xmlns:Repl=3D"http://=" xmlns:mt=3D"
point/soap/meetings/" xmlns:x2=3D"
/2003/xml" xmlns:ppda=3D"" xmlns:ois=
=3D"" xmlns:dir=3D"http://=" xmlns:ds=3D"http://www.w3=
.org/2000/09/xmldsig#" xmlns:dsp=3D"
/dsp" xmlns:udc=3D"" xmlns:xsd=3D"http=
://" xmlns:sub=3D"
repoint/soap/2002/1/alerts/" xmlns:ec=3D""=
 xmlns:sp=3D"" xmlns:sps=3D"http://=" xmlns:xsi=3D"
/XMLSchema-instance" xmlns:udcs=3D"
ap" xmlns:udcxf=3D"" xmlns:udc=
p2p=3D"" xmlns:wf=3D"http:/=
/" xmlns:dsss=3D"http://sche=" xmlns:dssi=3D"http://schemas.mi=" xmlns:mdssi=3D"http://schemas.openxmlformat=" xmlns:mver=3D"http://schemas.openxmlf=" xmlns:m=3D"
om/office/2004/12/omml" xmlns:mrels=3D"
ckage/2006/relationships" xmlns:spwp=3D"
partpages" xmlns:ex12t=3D"
06/types" xmlns:ex12m=3D"
6/messages" xmlns:pptsl=3D"
deLibrary/" xmlns:spsl=3D"
Server/PublishedLinksService" xmlns:Z=3D"urn:schemas-microsoft-com:" xmlns:=
st=3D"&#1;" xmlns=3D""><head><meta http-equi=
v=3DContent-Type content=3D"text/html; charset=3Dus-ascii"><meta name=3DGen=
erator content=3D"Microsoft Word 12 (filtered medium)"><style><!--
/* Font Definitions */
	panose-1:2 15 5 2 2 2 4 3 2 4;}
	panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
a:link, span.MsoHyperlink
a:visited, span.MsoHyperlinkFollowed
p.MsoAcetate, li.MsoAcetate, div.MsoAcetate
	mso-style-link:"Balloon Text Char";
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
	{mso-style-name:"Balloon Text Char";
	mso-style-link:"Balloon Text";
@page WordSection1
	{size:8.5in 11.0in;
	margin:1.0in 1.0in 1.0in 1.0in;}
/* List Definitions */
@list l0
	mso-list-template-ids:-722439732 67698705 67698713 67698715 67698703 67698=
713 67698715 67698703 67698713 67698715;}
@list l0:level1
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext=3D"edit">
<o:idmap v:ext=3D"edit" data=3D"1" />
</o:shapelayout></xml><![endif]--></head><body lang=3DEN-US link=3Dblue vli=
nk=3Dpurple><div class=3DWordSection1><p class=3DMsoNormal>Hi,<o:p></o:p></=
p><p class=3DMsoNormal>&nbsp; We are running a two-node cluster using pacem=
aker 1.1.5-18.1 with heartbeat 3.0.4-41.1.&nbsp; We are experiencing what s=
eems like network issues and cannot make heartbeat recover.&nbsp; We are ex=
periencing &#8220;message too long&#8221; and the systems can no longer syn=
c. <o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoN=
ormal>Our is as follows:<o:p></o:p></p><p class=3DMsoNormal>autojoin =
none<o:p></o:p></p><p class=3DMsoNormal>use_logd false<o:p></o:p></p><p cla=
ss=3DMsoNormal>logfacility daemon<o:p></o:p></p><p class=3DMsoNormal>debug =
0<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNor=
mal># use the v2 cluster resource manager<o:p></o:p></p><p class=3DMsoNorma=
l>crm yes<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=
=3DMsoNormal># the cluster communication happens via unicast on bond0 and h=
b1<o:p></o:p></p><p class=3DMsoNormal># hb1 is direct connect<o:p></o:p></p=
><p class=3DMsoNormal>ucast hb1<o:p></o:p></p><p class=3DMsoNor=
mal>ucast hb1<o:p></o:p></p><p class=3DMsoNormal>ucast bond0 17=<o:p></o:p></p><p class=3DMsoNormal>ucast bond0<o:=
p></o:p></p><p class=3DMsoNormal>compression zlib<o:p></o:p></p><p class=3D=
MsoNormal>compression_threshold 30<o:p></o:p></p><p class=3DMsoNormal><o:p>=
&nbsp;</o:p></p><p class=3DMsoNormal># msgfmt<o:p></o:p></p><p class=3DMsoN=
ormal>msgfmt netstring<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p>=
</p><p class=3DMsoNormal># a node will be flagged as dead if there is not r=
esponse for 20 seconds<o:p></o:p></p><p class=3DMsoNormal>deadtime 30<o:p><=
/o:p></p><p class=3DMsoNormal>initdead 30<o:p></o:p></p><p class=3DMsoNorma=
l>keepalive 250ms<o:p></o:p></p><p class=3DMsoNormal>uuidfrom nodename<o:p>=
</o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal># =
these are the node names participating in the cluster<o:p></o:p></p><p clas=
s=3DMsoNormal># the names should match &quot;uname -n&quot; output on the s=
ystem<o:p></o:p></p><p class=3DMsoNormal>node usrv-qpr2<o:p></o:p></p><p cl=
ass=3DMsoNormal>node usrv-qpr5<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbs=
p;</o:p></p><p class=3DMsoNormal>We can ping all interfaces from both nodes=
.&nbsp; One of the bonded NICs had some trouble, but we believe we have eno=
ugh redundancy built in that it should be fine.<o:p></o:p></p><p class=3DMs=
oNormal>The issue we see that if we reboot the non DC node it can no longer=
 sync with the DC.&nbsp; The log from the non-dc node shows remote node can=
not be reached.&nbsp; Crm_mon of the non-dc node shows:<o:p></o:p></p><p cl=
ass=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>Last updated: Fri=
 Aug 19 07:39:05 2011<o:p></o:p></p><p class=3DMsoNormal>Stack: Heartbeat<o=
:p></o:p></p><p class=3DMsoNormal>Current DC: NONE<o:p></o:p></p><p class=
=3DMsoNormal>2 Nodes configured, 2 expected votes<o:p></o:p></p><p class=3D=
MsoNormal>26 Resources configured.<o:p></o:p></p><p class=3DMsoNormal>=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<o:p></o:p></p><p class=3DMsoNormal><o:p>&=
nbsp;</o:p></p><p class=3DMsoNormal>Node usrv-qpr2 (87df4a75-fa67-c05e-1a07=
-641fa79784e0): UNCLEAN (offline)<o:p></o:p></p><p class=3DMsoNormal>Node u=
srv-qpr5 (7fb57f74-fae5-d493-e2c7-e4eda2430217): UNCLEAN (offline)<o:p></o:=
p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>From t=
he DC it looks like all is well.<o:p></o:p></p><p class=3DMsoNormal><o:p>&n=
bsp;</o:p></p><p class=3DMsoNormal>I tried a cibadmin &#8211;Q from non DC =
and it can no longer contact the remote node.<o:p></o:p></p><p class=3DMsoN=
ormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>I tried a cibadmin &#8211;S=
 from the non DC to force a sync which times out with Call cib_sync failed =
(-41): Remote node did not respond.<o:p></o:p></p><p class=3DMsoNormal><o:p=
>&nbsp;</o:p></p><p class=3DMsoNormal>On the DC side I see this:<o:p></o:p>=
</p><p class=3DMsoNormal>Aug 19 07:38:20 usrv-qpr2 heartbeat: [23249]: ERRO=
R: write_child: write failure on ucast bond0.: Message too long<o:p></o:p><=
/p><p class=3DMsoNormal>Aug 19 07:38:20 usrv-qpr2 heartbeat: [23251]: ERROR=
: glib: ucast_write: Unable to send HBcomm packet bond0 l=
en=3D83696 [-1]: Message too long<o:p></o:p></p><p class=3DMsoNormal>Aug 19=
 07:38:20 usrv-qpr2 heartbeat: [23251]: ERROR: write_child: write failure o=
n ucast bond0.: Message too long<o:p></o:p></p><p class=3DMsoNormal>Aug 19 =
07:38:20 usrv-qpr2 heartbeat: [23253]: ERROR: glib: ucast_write: Unable to =
send HBcomm packet hb1 len=3D83696 [-1]: Message too long<o=
:p></o:p></p><p class=3DMsoNormal>Aug 19 07:38:20 usrv-qpr2 heartbeat: [232=
53]: ERROR: write_child: write failure on ucast hb1.: Message too long<o:p>=
</o:p></p><p class=3DMsoNormal>Aug 19 07:38:20 usrv-qpr2 heartbeat: [23255]=
: ERROR: glib: ucast_write: Unable to send HBcomm packet hb1
4 len=3D83696 [-1]: Message too long<o:p></o:p></p><p class=3DMsoNormal>Aug=
 19 07:38:20 usrv-qpr2 heartbeat: [23255]: ERROR: write_child: write failur=
e on ucast hb1.: Message too long<o:p></o:p></p><p class=3DMsoNormal>Aug 19=
 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is fillin=
g up (500 messages in queue)<o:p></o:p></p><p class=3DMsoNormal>Aug 19 07:3=
8:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is filling up =
(500 messages in queue)<o:p></o:p></p><p class=3DMsoNormal>Aug 19 07:38:21 =
usrv-qpr2 heartbeat: [23222]: ERROR: Cannot rexmit pkt 244442 for usrv-qpr5=
: seqno too low<o:p></o:p></p><p class=3DMsoNormal>Aug 19 07:38:21 usrv-qpr=
2 heartbeat: [23222]: info: fromnode =3Dusrv-qpr5, fromnode's ackseq =3D 24=
4435<o:p></o:p></p><p class=3DMsoNormal>Aug 19 07:38:21 usrv-qpr2 heartbeat=
: [23222]: info: hist information:<o:p></o:p></p><p class=3DMsoNormal>Aug 1=
9 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hiseq =3D244943, lowseq=3D24=
4443,ackseq=3D244435,lastmsg=3D442<o:p></o:p></p><p class=3DMsoNormal>Aug 1=
9 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Cannot rexmit pkt 244442 fo=
r usrv-qpr5: seqno too low<o:p></o:p></p><p class=3DMsoNormal>Aug 19 07:38:=
21 usrv-qpr2 heartbeat: [23222]: info: fromnode =3Dusrv-qpr5, fromnode's ac=
kseq =3D 244435<o:p></o:p></p><p class=3DMsoNormal>Aug 19 07:38:21 usrv-qpr=
2 heartbeat: [23222]: info: hist information:<o:p></o:p></p><p class=3DMsoN=
ormal>Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hiseq =3D244943, =
lowseq=3D244443,ackseq=3D244435,lastmsg=3D442<o:p></o:p></p><p class=3DMsoN=
ormal>Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist que=
ue is filling up (500 messages in queue)<o:p></o:p></p><p class=3DMsoNormal=
>Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is=
 filling up (500 messages in queue)<o:p></o:p></p><p class=3DMsoNormal>Aug =
19 07:38:22 usrv-qpr2 heartbeat: [23222]: info: all clients are now resumed=
<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNorm=
al>My questions:<o:p></o:p></p><p class=3DMsoListParagraph style=3D'text-in=
dent:-.25in;mso-list:l0 level1 lfo1'><![if !supportLists]><span style=3D'ms=
o-list:Ignore'>1)<span style=3D'font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp; </span></span><![endif]>Seems like the compression is not=
 working.&nbsp; Is there something we need to do to enable it?&nbsp; We hav=
e tried both bz2 and &nbsp;zlib.&nbsp; We&#8217;ve played with the compress=
ion threshold as well.<o:p></o:p></p><p class=3DMsoListParagraph style=3D't=
ext-indent:-.25in;mso-list:l0 level1 lfo1'><![if !supportLists]><span style=
=3D'mso-list:Ignore'>2)<span style=3D'font:7.0pt "Times New Roman"'>&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp; </span></span><![endif]>How do we get the non DC sy=
stem back on-line?&nbsp; Rebooting does not work since the DC can&#8217;t s=
eem to send the diffs to sync it.<o:p></o:p></p><p class=3DMsoListParagraph=
 style=3D'text-indent:-.25in;mso-list:l0 level1 lfo1'><![if !supportLists]>=
<span style=3D'mso-list:Ignore'>3)<span style=3D'font:7.0pt "Times New Roma=
n"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><![endif]>If the diff it i=
s trying to send is truly too long, how do I recover from that?<o:p></o:p><=
/p><p class=3DMsoListParagraph style=3D'text-indent:-.25in;mso-list:l0 leve=
l1 lfo1'><![if !supportLists]><span style=3D'mso-list:Ignore'>4)<span style=
=3D'font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></s=
pan><![endif]>Would more information be useful in diagnosing the problem?<o=
:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal=
>Thanks in advance.<o:p></o:p></p><p class=3DMsoNormal>Diane Schaefer<o:p><=


More information about the Pacemaker mailing list