[ClusterLabs] Failing over NFSv4/TCP exports

Wed Aug 17 17:16:07 EDT 2016

Hi,

On Wed, Aug 17, 2016 at 3:44 PM, Patrick Zwahlen <paz at navixia.com> wrote:

> Dear list (sorry for the rather long e-mail),
>
> I'm looking for someone who has successfully implemented the "exportfs" RA
> with NFSv4 over TCP (and is willing to share some information).
>
> The final goal is to present NFS datastores to ESXi over 2 "head" nodes.
> Both nodes must be active in the sense that they both have an NFS server
> running but they export different file systems (via exports and floating
> IPAddr2).
>
> When moving an export to another node, we move the entire
> "filesystem/export/ipaddr" stack but we keep the NFS server running (as it
> might potentially be exporting some other file systems via other IPs).
>
> Both nodes are sharing disks (JBOD for physical and shared VMDKs for
> testing). Disks are only accessed by a single "head" node at any given time
> so a clustered file system is not required.
>
> To my knowledge, this setup has been best described by Florian Haas over
> there:
> https://www.suse.com/documentation/sle_ha/singlehtml/book_sleha_
> techguides/book_sleha_techguides.html
> (except we're not using DRBD and LVM)
>
> Before going into more details, I mention that I have already read all
> those posts and examples as well as many of the NFS related questions in
> this list for the past year or so.
>
> http://wiki.linux-nfs.org/wiki/index.php/Nfsd4_server_recovery
> http://wiki.linux-nfs.org/wiki/index.php/NFS_Recovery_and_Client_Migration
> http://oss.clusterlabs.org/pipermail/pacemaker/2011-July/011000.html
> https://access.redhat.com/solutions/42868
>
> I'm forced to use TCP because of ESXi and I'm willing to use NFSv4 because
> ESXi can use "session trunking" or some sort of "multipath" with version 4
> (not tested yet)
>
> The problem I see is what a lot of people have already mentioned: Failover
> works nicely but failback takes a very long time. Many posts mention
> putting /var/lib/nfs on a shared disk but this only makes sense when we
> failover an entire NFS server (compared to just exports). Moreover, I don't
> see any relevant information written to /var/lib/nfs when a single Linux
> NFSv4 client is mounting a folder.
>
> NFSv4 LEASE and GRACE time have been reduced to 10 seconds. I'm using the
> exportfs RA parameter "wait_for_leasetime_on_stop=true".
>
> From my investigation, the problem actually happens at the TCP level.
> Let's describe the most basic scenario, ie a single filesystem moving from
> node1 to node2 and back.
>
> I first start the NFS servers using a clone resource. Node1 then starts a
> group that mounts a file system, adds it to the export list (exportfs RA)
> and adds a floating IP.
>
> I then mount this folder from a Linux NFS client.
>
> When I "migrate" my group out of node1, everything correctly moves to
> node2. IPAddr2:stop, then the exportfs "stop" action takes about 12 seconds
> (10 seconds LEASE time plus the rest) and my file system gets unmounted.
> During that time, I see the NFS client trying to talk to the floating IP
> (on its node1 MAC address). Once everything has moved to node2, the client
> sends TCP packets to the new MAC address and node2 replies with a TCP
> RESET. At this point, the client restarts a NEW TCP session and it works
> fine.
>
> However, on node 1, I can still see an ESTABLISHED TCP session between the
> client and the floating IP on port 2049 (NFS), even though the IP is gone.
> After a short time, the session moves to FIN_WAIT1 and stays there for a
> while.
>
> When I then "unmigrate" my group to node1 I see the same behavior except
> that node1 is *not* sending TCP RESETS because it still has a TCP session
> with the client. I imagine that the sequence numbers do not match so node1
> simply doesn't reply at all. It then takes several minutes for the client
> to give up and restart a new NFS session.
>
> Does anyone have an idea about how to handle this problem ? I have done
> this with iSCSI where we can explicitly "kill" sessions but I don't think
> NFS has something similar. I also don't see anything in the IPAddr2 RA that
> would help in killing TCP sessions while removing a floating IP.
>

This is a known problem ... have a look into the portblock RA - it has the
feature to send out TCP tickle ACKs to reset such hanging sessions. So you
can configure a portblock resource that blocks the tcp port before starting
the VIP and another portblock resource that unblocks the port afterwards
and sends out that tickle ACKs.

Regards,
Andreas

>
> Next ideas would be to either tune the TCP stack in order to reduce the
> FIN_WAIT1 state or to synchronize sessions between the nodes (using
> conntrackd). That just seems an overkill.
>
> Thanks for any input! Patrick
>
>
> ************************************************************
> **************************
> This email and any files transmitted with it are confidential and
> intended solely for the use of the individual or entity to whom they
> are addressed. If you have received this email in error please notify
> the system manager. "postmaster at navixia.com"      Navixia SA
> ************************************************************
> **************************
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160817/a87abfd1/attachment-0003.html>