Improve speed of splitting file

Question

I am using this code to extract a chunk from file

// info is FileInfo object pointing to file
var percentSplit = info.Length * 50 / 100; // extract 50% of file
var bytes = new byte[percentSplit];
var fileStream = File.OpenRead(fileName);
fileStream.Read(bytes, 0, bytes.Length);
fileStream.Dispose();
File.WriteAllBytes(splitName, bytes);

Is there any way to speed up this process?

Currently for a 530 MB file it takes around 4 - 5 seconds. Can this time be improved?

extracting 50% of the file isn't efficient [why 4kb to 8kb](http://stackoverflow.com/a/5911016/495455). If you have .Net 4 or greater you can use [Memory Mapped Files](http://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile.aspx) — Jeremy Thompson, Feb 10 '13 at 04:35
What is performance of your disk system? 100MB/s does sound pretty reasonable. — Alexei Levenkov, Feb 10 '13 at 04:42
Can I ask what you are splitting the file for? Is the splitting the file your end result or is this a intermediate step to get around another issue? — Scott Chamberlain, Feb 10 '13 at 05:20
Arrays larger than 85KB will end up on the seldom collected never compacted large object heap. So, if this is something called very often from a long running process, you could wind up with memory problems reading 200+MB into an array. — devgeezer, Feb 10 '13 at 12:27
I'm interested in your question and start a bounty for it. I'd like to get an answer better than mine. — Ken Kin, Feb 19 '13 at 15:42
Although I don't have experience with this, the new ReFS file system might help you out. From what I've read, it's implemented as allocate-on-write, so if you just copy the file to 2 files and change the size for the first file (using SetLength), it should save you have the time. — atlaste, Feb 19 '13 at 19:51
FYI; Using PInvoke you use memory mapped files on pre .NET4 Environments as well. — Just another metaprogrammer, Feb 23 '13 at 17:40
@devgeezer: if allocating larger than 85k objects is a problem, then they should've capped arrays to 85k. It is not a problem. In fact, he should allocate much larger than 85k and reuse that array as much as possible. — Herman Schoenfeld, Feb 25 '13 at 02:20
5 seconds for writing 530/2 Mb is an adequate performance of the regular disk subsystem. Program algorithm seems not to be a bottleneck — Sergey P. aka azure, Feb 25 '13 at 13:38
@HermanSchoenfeld for more detail regarding the 85k threshold and GC behaviors of the LOH, read the following msdn article. http://msdn.microsoft.com/en-us/magazine/cc534993.aspx — devgeezer, Feb 28 '13 at 00:07
@devgeezer I read that article before I wrote here. The only real problem with LOH allocations is the possibility of memory fragmentation. It doesn't mean such arrays shouldn't be allocated. The case here warrants it, he would only need to allocate a single array (much) larger than 85k (perhaps 1MB) and simply reuse it. It would be much slower (software & hardware wise) to use a smaller array. — Herman Schoenfeld, Feb 28 '13 at 06:51
To be clear, I *never* said not to allocate a large-array; I offered a warning to be aware that there are consequences of frequent LOH allocation. There are several memory concerns for large-object use detailed in that article: LOH objects are collected in Gen-2 - the least frequently GC'd Gen, the LOH is never compacted (so the fragmentation can lead to memory bloat), CLR zero's memory before returning from allocation (can degrade performance). — devgeezer, Feb 28 '13 at 17:05

Ken Kin · Accepted Answer · 2013-02-21T10:56:21.863

There are several cases of you question, but none of them is language relevant.

Following are something to concern

What is the file system of source/destination file?
Do you want to keep original source file?
Are they lie on the same drive?

In c#, you almost do not have a method could be faster than File.Copy which invokes CopyFile of WINAPI internally. Because of the percentage is fifty, however, following code might not be faster. It copies whole file and then set the length of the destination file

var info=new FileInfo(fileName);
var percentSplit=info.Length*50/100; // extract 50% of file

File.Copy(info.FullName, splitName);
using(var outStream=File.OpenWrite(splitName))
    outStream.SetLength(percentSplit);

Further, if

you don't keep original source after file splitted
destination drive is the same as source
your are not using a crypto/compression enabled file system

then, the best thing you can do, is don't copy files at all. For example, if your source file lies on FAT or FAT32 file system, what you can do is

create new dir entry(entries) for newly splitted parts of file
let the entry(entries) point(s) to the cluster of target part(s)
set correct file size for each entry
check for cross-link and avoid that

If your file system was NTFS, you might need to spend a long time to study the spec.

Good luck!

+1: Ken, I have deleted my answer as I found a fairly serious bug which meant my approach did not perform reliably, and once fixed was actually much slower than yours. I will be really interested to see if anything can actually beat the performance of `File.Copy`. — nick_w, Feb 24 '13 at 07:53
This is actually a good benchmark for any suggested solution, which should run about twice as fast. Assuming File.Copy() runs at a given system's max, copying only half of it should take about half that time. — Hazzit, Feb 25 '13 at 19:49

score 2 · Answer 2 · edited Feb 25 '13 at 18:36

var percentSplit = (int)(info.Length * 50 / 100); // extract 50% of file
var buffer = new byte[8192];
using (Stream input = File.OpenRead(info.FullName))
using (Stream output = File.OpenWrite(splitName))
{
    int bytesRead = 1;
    while (percentSplit > 0 && bytesRead > 0)
    {
        bytesRead = input.Read(buffer, 0, Math.Min(percentSplit, buffer.Length));
        output.Write(buffer, 0, bytesRead);
        percentSplit -= bytesRead;
    }
    output.Flush();
}

The flush may not be needed but it doesn't hurt, this was quite interesting, changing the loop to a do-while rather than a while had a big hit on performance. I suppose the IL is not as fast. My pc was running the original code in 4-6 secs, the attached code seemed to be running at about 1 second.

score 0 · Answer 3 · answered Feb 11 '13 at 12:21

0

I get better results when reading/writing by chunks of a few megabytes. The performances changes also depending on the size of the chunk.

FileInfo info = new FileInfo(@"C:\source.bin");
FileStream f = File.OpenRead(info.FullName);
BinaryReader br = new BinaryReader(f);

FileStream t = File.OpenWrite(@"C:\split.bin");
BinaryWriter bw = new BinaryWriter(t);

long count = 0;
long split = info.Length * 50 / 100;
long chunk = 8000000;

DateTime start = DateTime.Now;

while (count < split)
{
    if (count + chunk > split)
    {
        chunk = split - count;
    }

    bw.Write(br.ReadBytes((int)chunk));
    count += chunk;
}

Console.WriteLine(DateTime.Now - start);

answered Feb 11 '13 at 12:21

Marc

856
1
8
20

You shouldn't allocate chunks bigger than 85K. see devgeezer remark in the question. – Simon Mourier Feb 19 '13 at 06:46
Allocating chunks larger than 85k is fine. In fact, the larger the better, so long as you reuse that chunk as much as possible. The only problem is fragmentation of the Large Object Heap which can result in an out of memory exception. Reusing the large buffer will prevent that, and when the buffer is no longer used (and memory is needed), it will be collected. No problem. – Herman Schoenfeld Feb 25 '13 at 02:17

Improve speed of splitting file

3 Answers3