[ClusterLabs] Failing over NFSv4/TCP exports

Wed Aug 17 09:44:28 EDT 2016

Dear list (sorry for the rather long e-mail),

I'm looking for someone who has successfully implemented the "exportfs" RA with NFSv4 over TCP (and is willing to share some information).

The final goal is to present NFS datastores to ESXi over 2 "head" nodes. Both nodes must be active in the sense that they both have an NFS server running but they export different file systems (via exports and floating IPAddr2). 

When moving an export to another node, we move the entire "filesystem/export/ipaddr" stack but we keep the NFS server running (as it might potentially be exporting some other file systems via other IPs).

Both nodes are sharing disks (JBOD for physical and shared VMDKs for testing). Disks are only accessed by a single "head" node at any given time so a clustered file system is not required.

To my knowledge, this setup has been best described by Florian Haas over there:
https://www.suse.com/documentation/sle_ha/singlehtml/book_sleha_techguides/book_sleha_techguides.html
(except we're not using DRBD and LVM)

Before going into more details, I mention that I have already read all those posts and examples as well as many of the NFS related questions in this list for the past year or so.

http://wiki.linux-nfs.org/wiki/index.php/Nfsd4_server_recovery
http://wiki.linux-nfs.org/wiki/index.php/NFS_Recovery_and_Client_Migration
http://oss.clusterlabs.org/pipermail/pacemaker/2011-July/011000.html
https://access.redhat.com/solutions/42868

I'm forced to use TCP because of ESXi and I'm willing to use NFSv4 because ESXi can use "session trunking" or some sort of "multipath" with version 4 (not tested yet)

The problem I see is what a lot of people have already mentioned: Failover works nicely but failback takes a very long time. Many posts mention putting /var/lib/nfs on a shared disk but this only makes sense when we failover an entire NFS server (compared to just exports). Moreover, I don't see any relevant information written to /var/lib/nfs when a single Linux NFSv4 client is mounting a folder.

NFSv4 LEASE and GRACE time have been reduced to 10 seconds. I'm using the exportfs RA parameter "wait_for_leasetime_on_stop=true".

>From my investigation, the problem actually happens at the TCP level. Let's describe the most basic scenario, ie a single filesystem moving from node1 to node2 and back.

I first start the NFS servers using a clone resource. Node1 then starts a group that mounts a file system, adds it to the export list (exportfs RA) and adds a floating IP.

I then mount this folder from a Linux NFS client.

When I "migrate" my group out of node1, everything correctly moves to node2. IPAddr2:stop, then the exportfs "stop" action takes about 12 seconds (10 seconds LEASE time plus the rest) and my file system gets unmounted. During that time, I see the NFS client trying to talk to the floating IP (on its node1 MAC address). Once everything has moved to node2, the client sends TCP packets to the new MAC address and node2 replies with a TCP RESET. At this point, the client restarts a NEW TCP session and it works fine.

However, on node 1, I can still see an ESTABLISHED TCP session between the client and the floating IP on port 2049 (NFS), even though the IP is gone. After a short time, the session moves to FIN_WAIT1 and stays there for a while.

When I then "unmigrate" my group to node1 I see the same behavior except that node1 is *not* sending TCP RESETS because it still has a TCP session with the client. I imagine that the sequence numbers do not match so node1 simply doesn't reply at all. It then takes several minutes for the client to give up and restart a new NFS session.

Does anyone have an idea about how to handle this problem ? I have done this with iSCSI where we can explicitly "kill" sessions but I don't think NFS has something similar. I also don't see anything in the IPAddr2 RA that would help in killing TCP sessions while removing a floating IP.

Next ideas would be to either tune the TCP stack in order to reduce the FIN_WAIT1 state or to synchronize sessions between the nodes (using conntrackd). That just seems an overkill.

Thanks for any input! Patrick

**************************************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager. "postmaster at navixia.com"      Navixia SA
**************************************************************************************