6

I'm running Delphi RAD Studio XE2.

I have some very large files, each containing a large number of lines. The lines themselves are small - just 3 tab separated doubles. I want to load a file into a TStringList using TStringList.LoadFromFile but this raises an exception with large files.

For files of 2 million lines (approximately 1GB) I get the EIntOverflow exception. For larger files (20 million lines and approximately 10GB, for example) I get the ERangeCheck exception.

I have 32GB of RAM to play with and am just trying to load this file and use it quickly. What's going on here and what other options do I have? Could I use a file stream with a large buffer to load this file into a TStringList? If so could you please provide an example.

Trojanian
  • 692
  • 2
  • 9
  • 25
  • I just have to wonder, why are you loading 20 million lines of text? You may have better luck using a `TFileStream`. – Jerry Dodge Nov 19 '14 at 02:40
  • Do you have an example showing how to use `TFileStream` to read lines of a text file into a `TStringList`? – Trojanian Nov 19 '14 at 02:51
  • I would prefer to store file's lines in a table in database. Manipulation then will be much faster than using T*List descendants. So the question is what you intend to do with the data? – iPath ツ Nov 19 '14 at 05:40
  • 1
    Simply put, the real solution is to stop trying load the entire file into memory. – David Heffernan Nov 19 '14 at 07:06

1 Answers1

20

When Delphi switched to Unicode in Delphi 2009, the TStrings.LoadFromStream() method (which TStrings.LoadFromFile() calls internally) became very inefficient for large streams/files.

Internally, LoadFromStream() reads the entire file into memory as a TBytes, then converts that to a UnicodeString using TEncoding.GetString() (which decodes the bytes into a TCharArray, copies that into the final UnicodeString, and then frees the array), then parses the UnicodeString (while the TBytes is still in memory) adding substrings into the list as needed.

So, just prior to LoadFromStream() exiting, there are four copies of the file data in memory - three copies taking up at worse filesize * 3 bytes of memory (where each copy is using its own contiguous memory block + some MemoryMgr overhead), and one copy for the parsed substrings! Granted, the first three copies are freed when LoadFromStream() actually exits. But this explains why you are getting memory errors before reaching that point - LoadFromStream() is trying to use 3-4 GB of memory to load a 1GB file, and the RTL's memory manger cannot handle that.

If you want to load the content of a large file into a TStringList, you are better off using TStreamReader instead of LoadFromFile(). TStreamReader uses a buffered file I/O approach to read the file in small chunks. Simply call its ReadLine() method in a loop, Add()'ing each line to the TStringList. For example:

//MyStringList.LoadFromFile(filename);
Reader := TStreamReader.Create(filename, true);
try
  MyStringList.BeginUpdate;
  try
    MyStringList.Clear;
    while not Reader.EndOfStream do
      MyStringList.Add(Reader.ReadLine);
  finally
    MyStringList.EndUpdate;
  end;
finally
  Reader.Free;
end;

Maybe some day, LoadFromStream() might be re-written to use TStreamReader internally like this.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • And if you know how many lines there are use `sl.Capacity := KnownValue;` to prevent multiple calls to ReallocMem() – Gerry Coll Nov 19 '14 at 03:05
  • 1
    `TStringList` does not call `ReallocMem()` on every `Add()`, it grows its memory in exponential capacities. – Remy Lebeau Nov 19 '14 at 03:13
  • Memory is reallocated only when the current `Count` is at `Capacity` when adding a new string. The `Capacity` grows (in items, the byte count would be `Capacity*SizeOf(TStringItem)` plus a little MemoryMgr overhead) as follows: `0,4,8,12,28,44,60,76,95,118,147,183,228,285,356,445,556,...` – Remy Lebeau Nov 19 '14 at 03:21
  • 1
    Even if you don't know exactly how many list items there are/will be, **huge** performance gains can be had by pre-setting Capacity to a representatively large number (a best guess, if you can) and then setting it to the actual count when the items have finished loading to reclaim any 'waste'. In this case, a good guesstimate at required capacity could be made given that the format of each line in the file is known (3 tab delim doubles): capacity := file size / average line length – Deltics Nov 19 '14 at 03:30
  • @RemyLebeau Thanks for this. I'm testing it now and it solves my problem (at least for 5GB files). How can I tweak it to improve the performance? Is your solution using a default buffer size? How do I alter the buffer size? Furthermore, in some cases (not all) I know the number of lines and the format of each line. – Trojanian Nov 19 '14 at 05:03
  • @RemyLebeau - never said it did grow each time. It grows by 25% when it reaches capacity. Older versions used ReallocMem, newer use SetLength, but use a delta of current capacity / 4 – Gerry Coll Nov 19 '14 at 07:36
  • @Trojanian: `TStreamReader` uses a 4KB buffer by default, but you can specify a different buffer size in the constructor. And there are plenty of third-party buffered I/O `TFileStream` implementations floating around. – Remy Lebeau Nov 19 '14 at 08:12
  • @RemyLebeau: The overloaded contstructor I need is then `System.Classes.TStreamReader.Create(const Filename: string; Encoding: TEncoding; DetectBOM: Boolean = False; BufferSize: Integer = 1024)`. What is DetectBOM? – Trojanian Nov 19 '14 at 12:41
  • @GerryColl: How do I pre-set the capacity within the given answer example code? – Trojanian Nov 19 '14 at 12:47
  • 1
    @Trojanian: yes, that would be the constructor to use. `DetectBOM` tells the reader whether it can look at the beginning of the file to see if there is a [BOM](http://en.wikipedia.org/wiki/Byte_order_mark) specifying the encoding of the data. Otherwise, you have to specify an encoding in the `Encoding` parameter. Since you are loading a text file, and `TStreamReader` (and `TStringList`) operates on Unicode strings, the reader needs to know what the file encoding is so it can decode the text to Unicode while reading. – Remy Lebeau Nov 19 '14 at 18:29
  • 2
    @Trojanian: Deltics told you how to pre-set the capacity: `capacity := file size / average line length`. For example: `MyStringList.Capacity := Reader.BaseStream.Size div AverageLineLength;` You have to provide a value for `AverageLineLength` based on what your data actually looks like. – Remy Lebeau Nov 19 '14 at 18:31
  • @RemyLebeau: Thanks - very coherent. I learnt from this post. :-) – Trojanian Nov 20 '14 at 03:09
  • FWIW, stream reader is appallingly inefficient. Every time you consume something, the remainder of the buffer is moved down with `TStringBuilder.Remove`. This even ends up reallocating the buffer to reduce its capacity. Stream reader performance gets worse as the buffer size is increased. I cannot believe how appallingly bad the implementation is. – David Heffernan Mar 24 '15 at 12:08