Split large file without copy?

Question

Question: Are there Windows API calls (perhaps NTFS only) which allows one to split a very large file into many others without actually copying any data (in other words, specify the logical breakpoints between joined files, with file names and sizes)?

Examples: SetFileValidData, NtSetInformationFile

Scenario: I need to programatically distribute/copy 10GB of files from a non-local drive (including network, USB and DVD drives). This is made up of over 100,000 individual files with median size about 16kbytes, but joined into ~2GB chunks.

However, using simple FileStream api's (64kb buffer) to extract files from the chunks on non-local drives to individual files on a local hard drive seems to be limited on my machine to about 4MB/s, whereas copying the entire chunks using Explorer occurs at over 80MB/s!

It seems logical to copy entire chunks, but give Windows enough info to logically split the files (which theoretically should be able to happen very, very fast).

Doesn't the Vista install do something like this?

I wouldn't use TFileStream, I'd suggest using THandleStream with a CreateFile call that uses FILE_FLAG_SEQUENTIAL_SCAN. Also try using a 256KB buffer, it may be faster. — Jon Benedicto, Oct 06 '09 at 23:23
Thanks Jon for the suggestion. Yes, I have tried this and have gotten slightly better performance (and even tried FILE_FLAG_NO_BUFFERING, along with the headaches that involves), yet still performance is order of magnitude slower copying so many small files compared to copying them merged together. — tikinoa, Oct 06 '09 at 23:36
How are you merging them before the copy? Why can't it unmerge them after the copy? — Bill, Oct 07 '09 at 15:03

score 3 · Answer 1 · answered Oct 06 '09 at 23:33

Although there Volume Shadow Copies, these are an all-or-nothing approach - you can't cut out just part of a file. They are also only temporary. Likewise, hard links share all content, without exceptions. Unfortunately, cutting out just parts of a file is not supported on Windows, although some experimental Linux filesystems such as btrfs support it.

score 3 · Answer 2 · answered Oct 07 '09 at 11:49

You can't in practice. The data has to physically move, if any new boundary would not coincide with an existing cluster boundary.

For a high-speed copy, read the input file in asynchronously, break it up in your 16KB segments, post those to a queue (in memory) and set up a threadpool to empty the queue by writing out those 16KB segments. Considering those sizes, the writes probably can be synchronous. Considering the speed of local I/O and remote I/O, and the fact that you have multiple writer threads, the chance of your queue overflowing should be quite low.

score 0 · Answer 3 · edited May 23 '17 at 10:27

0

A thought on this: Is there enough space to copy the large chunk to a local drive and then work on it using it as a Memory Mapped file? I remember a discussion somewhere some-when that these files are very much faster as they use the windows File/page cache and are easy to set up.

From Wikipedia and from StackOverflow

edited May 23 '17 at 10:27

Community

1
1

answered Oct 08 '09 at 09:15

Despatcher

1,745
12
18

score 0 · Answer 4 · answered Oct 19 '09 at 17:53

Perhaps this technique would work for you: Copy the large chunks (using the already established efficient method), then use something like the following script to split the large chunks into smaller chunks locally.

from __future__ import division
import os
import sys
from win32file import CreateFile, SetEndOfFile, GetFileSize, SetFilePointer, ReadFile, WriteFile
import win32con
from itertools import tee, izip, imap

def xfrange(start, stop=None, step=None):
    """
    Like xrange(), but returns list of floats instead

    All numbers are generated on-demand using generators
    """

    if stop is None:
        stop = float(start)
        start = 0.0

    if step is None:
        step = 1.0

    cur = float(start)

    while cur < stop:
        yield cur
        cur += step


# from Python 2.6 docs
def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

def get_one_hundred_pieces(size):
    """
    Return start and stop extents for a file of given size
    that will break the file into 100 pieces of approximately
    the same length.

    >>> res = list(get_one_hundred_pieces(205))
    >>> len(res)
    100
    >>> res[:3]
    [(0, 2), (2, 4), (4, 6)]
    >>> res[-3:]
    [(199, 201), (201, 203), (203, 205)]
    """
    step = size / 100
    cap = lambda pos: min(pos, size)
    approx_partitions = xfrange(0, size+step, step)
    int_partitions = imap(lambda n: int(round(n)), approx_partitions)
    partitions = imap(cap, int_partitions)
    return pairwise(partitions)

def save_file_bytes(handle, length, filename):
    hr, data = ReadFile(handle, length)
    assert len(data) == length, "%s != %s" % (len(data), length)
    h_dest = CreateFile(
        filename,
        win32con.GENERIC_WRITE,
        0,
        None,
        win32con.CREATE_NEW,
        0,
        None,
        )
    code, wbytes = WriteFile(h_dest, data)
    assert code == 0
    assert wbytes == len(data), '%s != %s' % (wbytes, len(data))

def handle_command_line():
    filename = sys.argv[1]
    h = CreateFile(
        filename,
        win32con.GENERIC_WRITE | win32con.GENERIC_READ,
        0,
        None,
        win32con.OPEN_EXISTING,
        0,
        None,
        )
    size = GetFileSize(h)
    extents = get_one_hundred_pieces(size)
    for start, end in reversed(tuple(extents)):
        length = end - start
        last = end - 1
        SetFilePointer(h, start, win32con.FILE_BEGIN)
        target_filename = '%s-%d' % (filename, start)
        save_file_bytes(h, length, target_filename)
        SetFilePointer(h, start, win32con.FILE_BEGIN)
        SetEndOfFile(h)

if __name__ == '__main__':
    handle_command_line()

This is a Python 2.6 script utilizing pywin32 to utilize the Windows APIs. The same technique could be implemented in Delphi or C++ easily enough.

The main routine is in handle_command_line. It takes a filename, and splits that filename into chunks based on the get_one_hundred_pieces function. Your application would substitute a more appropriate function to determine the appropriate extents.

It then copies the chunk into its own file and calls SetEndOfFile to shrink the larger file (since the content is now in its own file).

I have tested this against a 1GB file broken into 100 pieces and it ran in less than 30 seconds. Furthermore, this should theoretically run in a space-efficient manner (not consuming more than the total file size plus the largest chunk size at any given time). I suspect there are performance improvements, but this is mostly a proof of concept.

score 0 · Answer 5 · answered Nov 10 '09 at 13:06

0

You can copy second chunk of the file into new file and than truncate original file. In this approach you are copying only a half of file.

answered Nov 10 '09 at 13:06

denisenkom

414
2
3

score -1 · Answer 6 · answered Oct 06 '09 at 23:22

-1

Is there a reason you can't invoke the OS's copy routines to do the copying? That should do the same thing that Explorer does. It negates the need for your weird splitting thing, which I don't think exists.

answered Oct 06 '09 at 23:22

rmeador

25,504
18
62
103

Direct OS CopyFile routines are slightly faster than my own routines in copying the 100,000 files, yet still the performance is horrible (order of magnitude slower) compared to copying the files merged together. Hence the desire to copy them merged, but split after copy. – tikinoa Oct 06 '09 at 23:30

Split large file without copy?

6 Answers6