1

We have a requirement for a Winforms app to read thousands of files from a local filesystem (or a network location) and store them in a database.

I am wondering what would be the most efficient way to load the files? There could potentially be many gigabytes of data in total.

File.ReadAllBytes is currently used but the application eventually locks up as the computer's memory is used up.

The current code loops through a table containing file paths, which are used to read the binary data:

protected CustomFile ConvertFile(string path)
{
    try
    {
        byte[] file = File.ReadAllBytes(path);
        return new CustomFile { FileValue = file };
    }
    catch
    {
        return null;
    }
}

The data is then saved to the database (either SQL Server 2008 R2 or 2012) using NHibernate as ORM.

tshepang
  • 12,111
  • 21
  • 91
  • 136
user1838662
  • 503
  • 7
  • 17
  • 1
    What kind of database? – 500 - Internal Server Error Mar 29 '13 at 22:28
  • How big is the biggest file? – Jirka Hanika Mar 29 '13 at 22:29
  • Can you show us your code? Perhaps a little tweaking of you code maybe what's in order. – Mark Kram Mar 29 '13 at 22:31
  • The most memory efficient way would probably be to loop: read a file into memory, write the file to the database, free the memory, read the next file, etc. If the total amount of data to be read exceeds available memory, then what would you expect to happen? – Jim Mischel Mar 29 '13 at 22:35
  • The database will be either SQL Server 2008 R2 or 2012. Files are uploaded via a website so I doubt they will be massive, possibly upto 20meg. The total data may exceed many gigs. – user1838662 Mar 29 '13 at 22:36
  • If they are uploaded via a website, why are you using File.ReadAllBytes? Are you saving them to disk first? – Davin Tryon Mar 29 '13 at 22:39
  • Files originally uploaded via website, now being processed by a winforms app. Sorry, I mentioned that to illustrate they aren't going to be massive. Ive updated the original question. – user1838662 Mar 29 '13 at 22:43
  • @JimMischel That is not the most memory efficient way to do it at all. If you buffer your reads you can handle files way larger than actual memory... – Francisco Soto Mar 29 '13 at 22:56

2 Answers2

1

First, let me state that my knowledge is pre NET 4.0 so this information may be outdated because I know they were going to make improvements in this area.

Do not use File.ReadAllBytes to read large files (larger than 85kb), specially when you are doing it to many files sequentially. I repeat, do not.

Use something like a stream and BinaryReader.Read instead to buffer your reading. Even if this may sound not efficient since you won't blast the CPU through a single buffer, if you do it with ReadAllBytes it simply won't work as you discovered.

The reason for that is because ReadAllBytes reads the whole thing inside a byte array. If that byte array is >85Kb in mem (there are other considerations like # of array elements) it is going into the Large Object Heap, which is fine, BUT, LOH doesn't move memory around, nor defragments the released space, so, simplifying, this can happen:

  • Read 1GB file, you have a 1GB chunk in the LOH, save the file. (No GC cycle)
  • Read 1.5GB file, you request a 1.5GB chunk of memory, it goes into the end of the LOH, but say you get a GC cycle so the 1GB chunk you previously used gets cleared, but now you have a chunk of 2.5GB memory, the first 1GB free.
  • Read a 1.6GB file, the 1GB free block at the beginning doesn't work, so the allocator goes to the end. Now you have a 4.1GB chunk of memory.
  • Repeat.

You are running out of memory but you surely aren't actually using it all, fragmentation is probably killing you. Also you can actually hit a real OOM situation if the file is very large (I think the process space in Windows 32 bit is 2GB?).

If files aren't ordered or dependent on each other maybe a few threads reading them by buffering with a BinaryReader would get the job done.

References:

http://www.red-gate.com/products/dotnet-development/ants-memory-profiler/learning-memory-management/memory-management-fundamentals

https://www.simple-talk.com/dotnet/.net-framework/the-dangers-of-the-large-object-heap/

Francisco Soto
  • 10,277
  • 2
  • 37
  • 46
  • You're assuming that the database allows writing a single BLOB in multiple blocks. I suppose it's possible. I've never seen it done, but I'm not a DB guy. No need for `BinaryReader`. `FileStream.Read` works just fine, and saves you having to build a `BinaryReader`. – Jim Mischel Mar 30 '13 at 00:18
  • If needed he can also read into *many* buffers as opposed to one. This would not protect against large files consuming all memory but it certainly would against fragmentation. (If the buffers stay out of the LOH) – Francisco Soto Mar 30 '13 at 00:40
  • Ok so using File.ReadAllBytes is a bad idea as it will populate the LOH where memory cannot be easily reclaimed. How can i read the contents of a file into a byte array without running into this problem? – user1838662 Apr 01 '13 at 10:51
  • Use FileStream.Read() which reads as many bytes as you tell it into a byte array. This way you control the byte array size. – Francisco Soto Apr 01 '13 at 16:29
  • @FranciscoSoto I have updated the original question with code based on FileStream, does it look ok? – user1838662 Apr 01 '13 at 20:19
  • It certainly does. You probably need to add a bit of error handling there, but it looks way better than using ReadAllBytes. – Francisco Soto Apr 02 '13 at 06:43
  • @FranciscoSoto I am still seeing rapid memory growth with that code, and the memory isnt being released. – user1838662 Apr 02 '13 at 07:52
  • MemoryStream is holding everything you write to it in memory. Which accounts for the big memory growth, but I am not sure about MemoryStream internals, if they use a single buffer or not. If they do, then you will get pretty much the same behavior as if you did it yourself. – Francisco Soto Apr 02 '13 at 19:11
  • I thought it best to ask a different question related to the first http://stackoverflow.com/questions/15781521/avoiding-the-loh-when-reading-a-binary – user1838662 Apr 03 '13 at 07:41
0

If you have many files, you should read them one-by-one.

If you have big files, and the database allows it, you should read them block by block into a buffer and write them block by block to the database. If you use File.ReadAllBytes, you might get an OutOfMemoryException when the file is too big to fit in the runtime's memory. The upper limit is less than 2 GiB, and even less when the memory is fragmented when the application runs for a while.

Daniel A.A. Pelsmaeker
  • 47,471
  • 20
  • 111
  • 157