<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 TRANSITIONAL//EN">

<HTML>

<HEAD>

  <META HTTP-EQUIV="Content-Type" CONTENT="text/html; CHARSET=UTF-8">

  <META NAME="GENERATOR" CONTENT="GtkHTML/3.16.3">

</HEAD>

<BODY>

Hi Lars,<BR>

<BR>

1) wireshark ...  really nice tool. wireshark and I are already well on our way of becoming close friends as I try to debug this situation.<BR>

<BR>

2) this is a pure test environment with everything that I can do to make the setup simple.  Therefore no firewall configured on these systems.  (All firewall is handled outside of my local environment.)<BR>

<BR>

3) I've tried to manipulate "timeo" and "retrans".  These are the current test values: timeo=20,retrans=4, which work great with NFSv3 reads over TCP.<BR>

<BR>

4) This  is SLES11 HAE GA release. Kernel is 2.6.27.19-5-default.<BR>

<BR>

5) 

<BLOCKQUOTE TYPE=CITE>

<PRE>

<FONT COLOR="#000000">analysing the network dump during a switchover/failover should be enough</FONT> <FONT COLOR="#000000">to trouble shoot your issue.</FONT>

</PRE>

</BLOCKQUOTE>

<BR>

So thought I tooooo.  But, the best that I've done is to become suspicious about retries after the migration with streamed writes.  But, retries is a bit of a "duhhhh" ... as in an obvious culprit to the crime, and my manipulations of "timeo" and "retrans" have not solved the issue.<BR>

<BR>

Anyone have any ideas why NFSv3 over TCP reads should be successful across 100s of migrations and failovers, but writes bomb?<BR>

<BR>

Thanks,<BR>

Bob Haxo<BR>

SGI<BR>

<BR>

<BR>

<BR>

On Wed, 2009-05-20 at 18:39 +0200, Lars Ellenberg wrote:

<BLOCKQUOTE TYPE=CITE>

<PRE>

<FONT COLOR="#000000">On Tue, May 19, 2009 at 03:15:17PM -0700, Bob Haxo wrote:</FONT>

<FONT COLOR="#000000">> Greetings,</FONT>

<FONT COLOR="#000000">> </FONT>

<FONT COLOR="#000000">> I find that streamed writes fail with migration for NFS v3 over TCP.</FONT>

<FONT COLOR="#000000">> Not every time, but almost every time.</FONT>

<FONT COLOR="#000000">> </FONT>

<FONT COLOR="#000000">> Streamed writes continue nicely across many migrations for NFS v3 over</FONT>

<FONT COLOR="#000000">> UDP.</FONT>

<FONT COLOR="#000000">> </FONT>

<FONT COLOR="#000000">> With TCP, writes continue with migration back to the initial server.</FONT>

<FONT COLOR="#000000">> </FONT>

<FONT COLOR="#000000">> Does anyone have HA NFS migrations working for NFS over TCP?</FONT>

<FONT COLOR="#000000">> </FONT>

<FONT COLOR="#000000">> Suggestions?</FONT>


<FONT COLOR="#000000">tcpdump/tshark dump nfs traffic during a switchover.</FONT>

<FONT COLOR="#000000">analyse with wireshark.</FONT>


<FONT COLOR="#000000">suspicions:</FONT>

<FONT COLOR="#000000"> timeo= mount option does a retry of failed requests every x seconds.</FONT>

<FONT COLOR="#000000"> maybe it just needs a long time to recognize the failover?</FONT>

<FONT COLOR="#000000"> do you find "NFS server not responding" in the client logs?</FONT>


<FONT COLOR="#000000"> connection tracking firewall on "new" server may drop tcp packets</FONT>

<FONT COLOR="#000000"> that do not fit into existing connections,</FONT>

<FONT COLOR="#000000"> so on retry you may run into much longer timeouts.</FONT>

<FONT COLOR="#000000"> if you have a firewall, and you only ACCEPT "new" or "established"</FONT>

<FONT COLOR="#000000"> connections, but DROP everything else, consider to instead REJECT</FONT>

<FONT COLOR="#000000"> with tcp-reset NFS traffic from internal clients that connection</FONT>

<FONT COLOR="#000000"> tracking does not know about.</FONT>


<FONT COLOR="#000000">analysing the network dump during a switchover/failover should be enough</FONT>

<FONT COLOR="#000000">to trouble shoot your issue.</FONT>


<FONT COLOR="#000000">btw, what kernel you are on?</FONT>


</PRE>

</BLOCKQUOTE>

</BODY>

</HTML>