NFS poor write performance

Question

I have two machines connected with 10Gbit Ethernet. Let one of them be NFS server and another will be NFs client.

Testing network speed over TCP with iperf shows ~9.8 Gbit/s throughput in both directions, so network is OK.

Testing NFS server's disk performance:

dd if=/dev/zero of=/mnt/test/rnd2 count=1000000

Result is ~150 MBytes/s, so disk works fine for writing.

Server's /etc/exports is:

/mnt/test 192.168.1.0/24(rw,no_root_squash,insecure,sync,no_subtree_check)

Client mounts this share to it's local /mnt/test with following options:

node02:~ # mount | grep nfs
192.168.1.101:/mnt/test on /mnt/test type nfs4 (rw,relatime,sync,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.102,local_lock=none,addr=192.168.1.101)

If I try to download a large file (~5Gb) on the client machine from the NFS share, I get ~130-140 MBytes/s performance which is close to server's local disk performance, so it's satisfactory.

But when I try do upload a large file to the NFS share, upload starts at ~1.5 Mbytes/s, slowly increases up to 18-20 Mbytes/s and stops increasing. Sometimes the share "hangs" for a couple of minutes before upload actually starts, i.e. traffic between hosts becomes close to zero and if I execute ls /mnt/test, it does not return during a minute or two. Then ls command returns and upload starts at it's initial 1.5Mbit/s speed.

When upload speed reaches it's maximum (18-20 Mbytes/s), I run iptraf-ng and it shows ~190 Mbit/s traffic on the network interface, so network is not a bottleneck here, as well as server's HDD.

What I tried:

1. Set up an NFS server on a third host which was connected only with a 100Mbit Ethernet NIC. Results are analogical: DL shows good performance and nearly full 100Mbit network utilization, upload does not perform faster than hundreds of kilobytes per second, leaving network utilization very low (2.5 Mbit/s according to iptraf-ng).

2. I tried to tune some NFS parameters:

sync or async
noatime
no hard
rsize and wsize are maximal in my examples, so I tried to decrease them in several steps down to 8192

3. I tried to switch client and server machines (set up NFS server on former client and vice versa). Moreover, there are six more servers with the same configuration, so I tried to mount them to each other in different variations. Same result.

4. MTU=9000, MTU=9000 and 802.3ad link aggregation, link aggregation with MTU=1500.

5. sysctl tuning:

node01:~ # cat /etc/sysctl.conf 
net.core.wmem_max=16777216
net.core.rmem_max=16777216
net.ipv4.tcp_rmem= 10240 873800 16777216
net.ipv4.tcp_wmem= 10240 873800 16777216
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_sack = 1
net.core.netdev_max_backlog = 5000

Same result.

6. Mount from localhost:

node01:~ # cat /etc/exports
/mnt/test *(rw,no_root_squash,insecure,sync,no_subtree_check)
node01:~ # mount -t nfs -o sync localhost:/mnt/test /mnt/testmount/

And here I get the same result: download from /mnt/testmount/ is fast, upload to /mnt/testmount/ is very slow, not faster than 22 MBytes/s and there is a small delay before transfer actually starts. Does it mean that network stack works flawlessly and the problem is in NFS?

All of this did not help, results didn't differ significantly from the default configuration. echo 3 > /proc/sys/vm/drop_caches was executed before all tests.

MTU of all NICS at all 3 hosts is 1500, no non-standard network tuning performed. Ethernet switch is Dell MXL 10/40Gbe.

OS is CentOS 7.

node01:/mnt/test # uname -a
Linux node01 3.10.0-123.20.1.el7.x86_64 #1 SMP Thu Jan 29 18:05:33 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

What settings am I missing? How to make NFS write quickly and without hangs?

You have a pretty well-rounded test case, but I'd try mounting on the server itself and writing from there, that way you can figure out if the NFS stack or the networking stack is at fault. Also, try switching the server and the client (export from client, mount on server) and also, using a different client altogether. stracing the server/client processes didn't reveal anything? — Dalibor Karlović, Apr 11 '15 at 11:12
@DaliborKarlović I tried all except strace and added information to the question. Mount from localhost works slow, so networking stack and switch seem not to be at fault. I use kernel-space NFS and get `Operation not permitted` on attempt to attach strace to NFS process. — Sergey, Apr 11 '15 at 16:50
I assume this means you can rule out the networking stack completely (but you'd need to attach strace to it to make sure). You should be able to strace any process as root user [if not hit by a certain bug](http://askubuntu.com/questions/143561/why-wont-strace-gdb-attach-to-a-process-even-though-im-root). — Dalibor Karlović, Apr 11 '15 at 16:59
@DaliborKarlović Surely I try strace as root. I'm able to attach to any userspace process, but not kernelspace ones. But what information can I get from it's output? I suppose it will produce hundreds of thousands lines of output if I attach it to NFS and start uploading. Should I pay attention to nonzero return values? — Sergey, Apr 11 '15 at 17:14
You're right, I wasn't thinking about it being a non-userland process. I'd expect to see what it was doing while it "hangs" at the beginning of the transfer, it might be something trivial like a misconfigured reverse DNS lookup. — Dalibor Karlović, Apr 11 '15 at 17:26
How are you actually downloading/uploading? `cp`, `rsync`, something else? — Marco Guerri, Apr 12 '15 at 10:08
install tuned package and use throughput-performance profile (tuned-adm profile throughput-performance). Set rsize/wsize to 65536. Try nfs3 ( mount -o vers=3 ...) — kofemann, Apr 24 '15 at 10:21
note that net.ipv4.tcp_sack=1 will only help if you have packet loss. Red hat tuning guide even suggest against the default (and put it off) — SvennD, Jul 28 '17 at 14:34
Wonder if `dd if=/dev/zero` is somehow misleading/allowing compression, maybe use `/dev/random` instead. Also as a note to followers I've "heard" that using nfs 4.x is slower than 3.x :) — rogerdpack, Oct 08 '18 at 20:24

Bernd Gloss · Answer 1 · 2015-08-17T10:28:29.670

You use the sync-option in your export statement. This means that the server only confirms write operations after they are actually written to the disk. Given you have a spinning disk (i.e. no SSD), this requires on average at least 1/2 revolution of the disk per write operation, which is the cause of the slowdown.

Using the async setting, the server immediately acknowledges the write-operation to the client when it is processed but not yet written to the disk. This is a little bit more unreliable, e.g., in case of a power failure when the client received an ack for an operation that did not happened. However, it delivers a huge increase in write-performance.

(edit) I just saw that you already tested the options async vs sync. However, I am almost sure that this is the cause of your performance degradation issue -- I once had exactly the same indication with an idencitcal setup. Maybe you test it again. Did you give the async option at the export statement of the server AND in the mount operation at the client at the same time?

+1 The most likely explanation is that sync was not correctly disabled. — David Schwartz, Aug 17 '15 at 10:26

score 2 · Answer 2 · answered Apr 11 '15 at 10:54

2

It can be a problem related to packet size and latency. Try the following:

enable jumbo frames (MTU >= 9000 bytes) on both machines
use UDP or, alternatively, manually increase TCP window size on both machines

The report back your results.

answered Apr 11 '15 at 10:54

shodanshok

47,711
7
111
180

1

I tried jumbo frames with MTU = 9000, but results were the same. I also tried link aggregation with 802.3ad, again no changes. So I reverted all these settings to get as close to the default state as possible. Also I tried to tune that `net.core.*` and `net.ipv4.*` sysctls, but maybe I made too few experiments. OK, I'll make some more tests and will report. – Sergey Apr 11 '15 at 11:04
I tried once more time to tune sysctls on both server and client, but that didn't help. – Sergey Apr 11 '15 at 12:10
Have you tried with UDP as the transport protocol? – shodanshok Apr 11 '15 at 13:57
I have tried UDP (proto=udp in mount options), but it works even 1-2 MBytes/s slower than TCP. Result was the same mounting from localhost and from remote host. – Sergey Apr 11 '15 at 16:00

Vasco V. · Answer 3 · 2015-04-19T12:22:10.363

http://veerapen.blogspot.com/2011/09/tuning-redhat-enterprise-linux-rhel-54.html

Configuring the Linux scheduler on systems with hardware RAID and changing the default from [cfq] to [noop] gives I/O improvements.

Use the nfsstat command, to calculate percentage of reads/writes. Set the RAID controller cache ratio to match.

For heavy workloads you will need to increase the number of NFS server threads.

Configure the nfs threads to write without delay to the disk using the no_delay option.

Tell the Linux kernel to flush as quickly as possible so that writes are kept as small as possible. In the Linux kernel, dirty pages writeback frequency can be controlled by two parameters.

For faster disk writes, use the filesystem data=journal option and prevent updates to file access times which in itself results in additional data written to the disk. This mode is the fastest when data needs to be read from and written to disk at the same time where it outperforms all other modes

score 0 · Answer 4 · answered Nov 28 '22 at 20:17

It is a bit tricky to find out why the transfer is so slow. This topic is pretty old, but it is still relatively difficult to find right solution fast, even these days.

BTW: Trying to switch from TCP to UDP only slowed things down in my case.

Using this one liner solved the problem right away (change network from 192.168.1.0/24 and /mount_target to fit your needs):

sudo exportfs -o rw,async,no_subtree_check 192.168.1.0/24:/mount_target

If you want to have these changes persist, add the same to the /etc/exports file like this:

/mount_target 192.168.1.0/24(rw,async,no_subtree_check)

Please remember that /24 grants access for your whole local network - 192.168.1.0 in this example.

It is a good practice to check and verify what these export options will mean to your system.

With settings mentioned above I was able to saturate my network connection (1Gb and 10Gb tested) to at least 95%. For sure you can optimize it further to squeeze some more throughput bytes, but it is absolutely enough for my usage. Hope it helps.

Note: tested on NFSv3 with transfers from Windows and Linux.

NFS poor write performance

4 Answers4

Linked