0

I have an AWS EFS drive mounted on a FTP server. I am seeing issues where performance becomes really poor occasionally. The public directory has about 10,000 small text files and under heaviish traffic (~30-40 concurrent users) the FTP LIST operation hangs. Even performing an ls command in the Linux terminal in the same directory is slow during these times. Strace on both the relevant proftp process and ls command shows it is stuck at getdents system call.

The standby FTP server has the same drive mounted and when I perform a move command in this directory to archive some of these 10k files I notice the main FTP server hangs at the LIST operation whilst the standby server is performing the move operation.

All metrics on both the server and the EFS drive show that resources are well within limits. When the LIST operation hangs on the main FTP server CPU is 99% IDLE, cpu wa is negligible. All proftp child processes go to sleep with D uninteruptible sleep while they wait for getdents to return the 10,000 directory listing.

EFS metrics show no issues with burst credits, throughput, total I/O etc.

Although not a good design it is something I can't change as it is a client's system. Under load testing it was shown to be able to handle 60 concurrent users performing the LIST operation. I am inclined to think some users are writing to the directory and that this is preventing the LIST operations from completing. I am aware of some OS bugs which make EFS/NFS operations serial rather than parallel but using a patched version of Amazon Linux this shouldn't be an issue as it's is fully up to date and these bugs doesn't apply. The EFS is mounted according to AWS defaults e.g. nfs hard mount etc

What is puzzling is why performance sometimes goes from very reasonable to crawling without any noticeable difference in traffic patterns. I wouldn't have thought that write/move operations on a separate server would hang LIST operations on the other server albeit on the same directory. This happens when I am not moving files also but I suspect it is because some users are writing to the directory.

Any thoughts would be appreciated if you have seen NFS issues like this before.

AdoEs
  • 1
  • 1
    I don't know what filesystem EFS is using internally, but many filesystems do get a bit slow with a large number of files in a single directory. I'd expect trouble closer to 100,000 than 10,000 but it's worth considering whether you can rearrange your directory structure. – Michael Hampton Aug 23 '19 at 20:13
  • I agree completely. I have recommended the change to the client before but that will be done long term. – AdoEs Aug 23 '19 at 20:30

0 Answers0