4

I'm moving a whole bunch of files from one place to another, some of them pretty large .wav files, and changing around the directory structure at the destination, so I couldn't copy directories wholesale. I originally was using the copyFile function recommended here: http://blogs.blumetech.com/blumetechs-tech-blog/2011/05/faster-python-file-copy.html

def copyFile(src, dst, buffer_size=10485760, perserveFileDate=True):
    '''
    Copies a file to a new location. Much faster performance than Apache Commons due to use of larger buffer
    @param src:    Source File
    @param dst:    Destination File (not file path)
    @param buffer_size:    Buffer size to use during copy
    @param perserveFileDate:    Preserve the original file date
    '''
    #    Check to make sure destination directory exists. If it doesn't create the directory
    dstParent, dstFileName = os.path.split(dst)
    if(not(os.path.exists(dstParent))):
        os.makedirs(dstParent)

    #    Optimize the buffer for small files
    buffer_size = min(buffer_size,os.path.getsize(src))
    if(buffer_size == 0):
        buffer_size = 1024

    if shutil._samefile(src, dst):
        raise shutil.Error("`%s` and `%s` are the same file" % (src, dst))
    for fn in [src, dst]:
        try:
            st = os.stat(fn)
        except OSError:
            # File most likely does not exist
            pass
        else:
            # XXX What about other special files? (sockets, devices...)
            if shutil.stat.S_ISFIFO(st.st_mode):
                raise shutil.SpecialFileError("`%s` is a named pipe" % fn)
    with open(src, 'rb') as fsrc:
        with open(dst, 'wb') as fdst:
            shutil.copyfileobj(fsrc, fdst, buffer_size)

    if(perserveFileDate):
        shutil.copystat(src, dst)

This was still taking forever, so just experimenting, I replaced it with this:

def copyFile(src, dst):
    os.system('cp -p %s %s' % (src, dst))

I wound up with about a 10x speedup! The first version was copying at about 22~25 KBps, and the second version is now at about 220 KBps. This is a just fine hack for myself, but I'd like to understand better why incase I need to develop & share code like this in the future.

JoFrhwld
  • 8,867
  • 4
  • 37
  • 32
  • Ignoring the tests done in that function since they will be negligible for large files, the obvious points to test are whether you provide the right buffer_size for your system (experiment with different values) and compare that function to the performance of `shutil.copy2`. – Midnighter May 11 '14 at 16:34
  • 1
    How exactly are you benchmarking it? I cannot repeat your results - with a ~700MB test file, copying from (old-generation) disk to disk takes ~9s with `shutil.copyfileobj`, and ~8.5s with `cp`. In other words, `cp` is somewhat faster, but to the order of 5% faster, not 10 times. Maybe you were benchmarking Python first and cp second, and didn't control for filesystem caching performed by the OS? – user4815162342 May 11 '14 at 21:23

0 Answers0