Fastest way to copy a large file locally

Question

I was asked this in an interview.

I said lets just use cp. Then I was asked to mimic implementation cp itself.

So I thought okay, lets open the file, read one by one and write it to another file.

Then I was asked to optimize it further. I thought lets do chunks of read and write those chunks. I didn't have a good answer about what would be good chunk size. Please help me out with that.

Then I was asked to optimize even further. I thought may be we could read from different threads in parallel and write it in parallel.

But I quickly realized reading in parallel is OK but writing will not work in parallel(without locking I mean) since data from one thread might overwrite others.

So I thought okay, lets read in parallel, put it in a queue and then a single thread will take it off the queue and write it to the file one by one.

Does that even improve performance? (I mean not for small files. it would be more overhead but for large files)

Also, is there like an OS trick where I could just point two files to the same data in disk? I mean I know there are symlinks but apart from that?

*But I quickly realized reading in parallel is OK but writing will not work in parallel(without locking I mean) since data from one thread might overwrite others.* What is that based on? There are numerous ways to write to a file from multiple threads in ways that require no locking. You can use `pwrite()` or `open()` the file multiple times. The real problem with parallel writes to most files is the extra seeks required for the physical disk heads. If the filesystem is a high-end HPC filesystem, though, files can be spread over multiple disks and parallel writes can be much faster — Andrew Henle, Mar 29 '18 at 09:32

score 1 · Accepted Answer · answered Mar 29 '18 at 10:18

"The fastest way to copy a file" is going to depend on the system - all the way from the storage media to the CPUs. The most likely bottleneck will be the storage media - but it doesn't have to be. Imagine high-end storage that can move data faster than your system can create physical page mappings to read the data into...

In general, the fastest way to move a lot of data is to make as few copies of it as possible, and to avoid any extra operations, especially S-L-O-W ones such as physical disk head seeks.

So for a local copy on a common single-rotating-disk workstation/desktop/laptop system, the biggest thing to do is minimize physical disk seeks. That means read and write single-threaded, in large chunks (1 MB, for example) so the system can do whatever optimization it can, such as read-ahead or write coalescing.

That will likely get you to 95% or even better of the system's maximum copy performance. Even standard C buffered fopen()/fread()/fwrite() probably gets at least 80-90% of the best possible performance.

You can get the last few percentage points in a few ways. First, by matching your IO block size to a multiple of the file system's block size so that you're always reading full blocks from the filesystem. Second, you can use direct IO to bypass copying your data through the page cache. It will be faster to go disk->userspace or userspace->disk than it is to go disk->page cache->userspace and userspace->page cache->disk, but for single-spinning-disk copy that's not going to matter much, if it's even measurable.

You can use various dd options to test copying a file like this. Try using direct, or notrunc.

You can also try using sendfile() to avoid copying data into userspace entirely. Depending on the implementation, that might be faster than using direct IO.

Pre-allocating the destination file may or may not improve copy performance - that will depend on the filesystem. If the filesystem doesn't support sparse files, though, preallocating the file to a specific length might very well be very, very slow.

There just isn't all that much you can do to dramatically improve performance of a copy from and to the same single spinning physical disk - those disk heads will dance, and that will take time.

SSDs are much easier - to get maximal IO rates, just use parallel IO via multiple threads. But again, the "normal" IO will probably be at 80-90% of maximal.

Things get a lot more interesting and complex optimizing IO performance for other types of storage systems such as large RAID arrays and/or complex filesystems that can stripe single files across multiple underlying storage devices. Maximizing IO on such systems involves matching the software's IO patterns to the characteristics of the storage, and that can be quite complex.

Finally, one important part of maximizing IO rates is not doing things that dramatically slow things down. It's really easy to drag a physical disk down to a few KB/sec IO rates - read/write small chunks from/to random locations all over the disk. If your write process drops 16-byte chunks to random locations, the disk will spend almost all its time seeking and it won't move much data at all while doing that.

In fact, not "killing yourself" with bad IO patterns is a lot more important than spending a lot of effort attempting to get a four or five percentage points faster in optimal cases.

Because if IO is a bottleneck on a simple system, just go buy a faster disk.

Hi Andrew, Thanks for the detailed explanation. I just have one doubt. _That means read and write single-threaded, in large chunks (1 MB, for example) so the system can do whatever optimization it can, such as read-ahead or write coalescing._ --- How do we decide what is a good chunk size? even 1 gb works by above logic right? — user3732361, Apr 15 '18 at 06:01

score 0 · Answer 2 · answered Mar 29 '18 at 05:30

But I quickly realized reading in parallel is OK but writing will not work in parallel(without locking I mean) since data from one thread might overwrite others.

Multithreading is not normally going to speed up a process like this. Any performance benefit you may gain could be wiped out by the synchronization overhead.

So I thought okay, lets read in parallel, put it in a queue and then a single thread will take it off the queue and write it to the file one by one.

That's only going to give an advantage on a system that supports asychronous I/O.

To get the maximum speed you'd want to write in buffer sizes that are increments of the cluster factor of the disk (assuming a hard file system). This could be sped up on systems that permit queuing asynchronous I/O (as does, say, Windoze).

You'd also want to create the output file with its initial size being the same as the input file. That ways your write operations never have to extend the file.

Probably the fastest file copy possible would be to memory map the input and output files and did a memory copy. This is especially efficient in systems that treat mapped files as page files.

*Probably the fastest file copy possible would be to memory map the input and output files and did a memory copy. This is especially efficient in systems that treat mapped files as page files.* Actually, `mmap()` used in this manner is not very likely to be anywhere near the fastest way. Read this LKML post: https://marc.info/?l=linux-kernel&m=95496636207616&w=2 It ends with "But your test-suite (just copying the data once) is probably pessimal for mmap(). Linus" — Andrew Henle, Mar 29 '18 at 09:28
That is why I had the qualification. Due to the nature of eunuchs file systems, I doubt that mmap on Linux actually maps memory directly to a file as memory mapping does on OSs with hard file systems. — user3344003, Mar 29 '18 at 14:44

Fastest way to copy a large file locally

2 Answers2