18

I have been having some issues with R becoming very sluggish when accessing files over our corporate network. So I dropped back and did some testing and I was shocked to discover that the R file.copy() command is much slower than the equivalent file copy using system(mv ...). Is this a known issue or am I doing something wrong here?

Here's my test:

I have three files:

  • large_random.txt - ~100 MB
  • medium_random.txt - ~10 MB
  • small_random.txt - ~1 MB

I created these on my Mac like so:

dd if=/dev/urandom of=small_random.txt bs=1048576 count=1
dd if=/dev/urandom of=medium_random.txt bs=1048576 count=10
dd if=/dev/urandom of=large_random.txt bs=1048576 count=100

But the following R tests were all done using Windows running in a virtual machine. The J: drive is local and the N: drive is 700 miles (1100 km) away.

library(tictoc)

test_copy <- function(source, des){
  tic('r file.copy')
  file.remove(des)
  file.copy(source, des )
  toc()

  tic('system call')
  system(paste('rm', des, sep=' '))
  system(paste('cp', source, des, sep=' '))
  toc()
}

source <- 'J:\\tidy_examples\\dummyfiles\\small_random.txt'
des <- 'N:\\JAL\\2018\\_temp\\small_random.txt'
test_copy(source, des)

source <- 'J:\\tidy_examples\\dummyfiles\\medium_random.txt'
des <- 'N:\\JAL\\2018\\_temp\\medium_random.txt'
test_copy(source, des)

source <- 'J:\\tidy_examples\\dummyfiles\\large_random.txt'
des <- 'N:\\JAL\\2018\\_temp\\large_random.txt'
test_copy(source, des)

Which results in the following:

> source <- 'J:\\tidy_examples\\dummyfiles\\small_random.txt'
> des <- 'N:\\JAL\\2018\\_temp\\small_random.txt'
> test_copy(source, des)
r file.copy: 6.49 sec elapsed
system call: 2.12 sec elapsed
>
> source <- 'J:\\tidy_examples\\dummyfiles\\medium_random.txt'
> des <- 'N:\\JAL\\2018\\_temp\\medium_random.txt'
> test_copy(source, des)
r file.copy: 56.86 sec elapsed
system call: 4.65 sec elapsed
>
> source <- 'J:\\tidy_examples\\dummyfiles\\large_random.txt'
> des <- 'N:\\JAL\\2018\\_temp\\large_random.txt'
> test_copy(source, des)
r file.copy: 562.94 sec elapsed
system call: 31.01 sec elapsed
>

So what's going on that makes the system call so much faster? At the large file size it's more than 18 times slower!

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
JD Long
  • 59,675
  • 58
  • 202
  • 294
  • Interesting. Is it safe to assume that fs caching is not a factor? (What if you reverse the order of `file.copy`/`system`?) Are both `J:` and `N:` network drives? Are the use of `file.remove` and `system("rm...")` similarly a problem (different, similar, ...)? What if you remove them from the tic/toc timing (or time them on their own)? – r2evans Apr 17 '18 at 22:03
  • Good questions. J is local & N is networked. The rm takes no time, it’s all in the copy. – JD Long Apr 17 '18 at 22:22
  • 6
    R implements it's own file copy, so it could be just inefficient with network I/O. With these lines of code it looks like it uses a pretty small buffer for windows, which would cause slow transfers (https://github.com/wch/r-source/blob/48499cfb5a9098dc6879a1fb517be3df5c146ab5/src/main/platform.c#L375). Finding this was the extent of my C knowledge though, so I'm not sure how it's actually implemented in the end. – Shawn Apr 17 '18 at 22:32
  • Exacerbated by the fact that R's `system` and `system2` are a bit broken (e.g., whitespace). JD, I have not tried it, but I wonder if [`fs`](https://cran.r-project.org/web/packages/fs/index.html) would perform any better. – r2evans Apr 17 '18 at 22:34
  • 1
    `mingw` `mv` (via `coretools`) which is really `cp` which calls underlying `copy` code that eventually computes a buffer size — even under Windows — uses (IIRC) 8K vs the 512b @Shawn noted. That's def going to have an impact (and not in gd way) for `file.copy()`. – hrbrmstr Apr 18 '18 at 03:24
  • 2
    everything above was in bog standard R GUI in Windows. But I've noticed terrible performance when running RStudio on a proj that's located on a network share. All the RStudio panes become laggy as hell. I wonder if this is related... – JD Long Apr 18 '18 at 13:32

1 Answers1

14

I ran into the same problem with low performance of file.copy over network share drives. My solution was to use fs::file_copy() instead which performed even slightly better than the direct system call of copy.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
N'toN
  • 191
  • 1
  • 5
  • 2
    I confirm the `fs` solution solved the problem, however it would be nice to understand what happens with standard `file.copy` – Waldi Nov 07 '22 at 14:11