6

I have a few very large files each of 500MB++ size, containing integer values (in fact it's a bit more complex), I'm reading those files in a loop and calculating the max value for all files. By some reason the memory is growing constantly during the processing, it looks like GC never releases the memory, acquired by the previous instances of lines.

I cannot stream the data and have to use GetFileLines for each file. Provided the actual amount of memory required to store lines for one file is 500MB, why do I get 5GB of RAM used after 10 files being processed? Eventually it crashes with Out of Memory exception after 15 files.

Calculation:

   int max = int.MinValue;

   for (int i = 0; i < 10; i++)
   {
      IEnumerable<string> lines = Db.GetFileLines(i);

      max = Math.Max(max, lines.Max(t=>int.Parse(t)));
   }

GetFileLines code:

   public static List<string> GetFileLines(int i)
   {
      string path = GetPath(i);

      //
      List<string> lines = new List<string>();
      string line;

      using (StreamReader reader = File.OpenText(path))
      {
         while ((line = reader.ReadLine()) != null)
         {
            lines.Add(line);
         }

         reader.Close();
         reader.Dispose(); // should I bother?
      }

      return lines;
   }
user1514042
  • 1,899
  • 7
  • 31
  • 57

6 Answers6

6

For very large file, method ReadLines would be the best fit because it is deferred execution, it does not load all lines in memory and simple to use:

  Math.Max(max, File.ReadLines(path).Max(line => int.Parse(line)));

More information:

http://msdn.microsoft.com/en-us/library/dd383503.aspx

Edit:

This is how ReadLines implement behind the scene:

    public static IEnumerable<string> ReadLines(string fileName)
    {
        string line;
        using (var reader = File.OpenText(fileName))
        {
            while ((line = reader.ReadLine()) != null)
                yield return line;
        }
    }

Also, it is recommended using parallel processing to improve performance when you have multiple files

cuongle
  • 74,024
  • 28
  • 151
  • 206
4

You could be crashing because you are keeping references to the parsed result in memory after you are finished processing them (the code you show doesn't do this, but is that the same code you run?). It's highly unlikely that there's such a bug in StreamReader.

Are you sure you have to read all the file in memory at once? It might be quite possible to use an enumerable sequence of lines as IEnumerable<string> instead of loading up a List<string> up front. There is nothing that prohibits this, in this code at least.

Finally, the Close and Dispose calls are redundant; using takes care of that automatically.

Jon
  • 428,835
  • 81
  • 738
  • 806
  • Well I only use value types, can they still hold the reference? – user1514042 Oct 02 '12 at 11:30
  • Of course they can. If you can somehow access the list, someone is holding a reference to it. – Jon Oct 02 '12 at 11:33
  • True, but it gets replaced everythime, your point would be right if I was unhappy final 500MB not being cleared, but I have a different problem. – user1514042 Oct 02 '12 at 11:43
  • 1
    @user1514042: If you are running out of memory, somewhere there are references that are not being cleared. It's that simple. – Jon Oct 02 '12 at 11:47
  • @user1514042, careful in your speech friend. You are certainly not managing memory the way you think you are or you wouldn't be running out of memory. Keep in mind that this line `IEnumerable lines = Db.GetFileLines(i);` literally copies the lists every time but **only** replaces the previous reference, therefore the previous `List` still exists on the heap. – Mike Perrenoud Oct 02 '12 at 11:49
  • @user1514042: Very much so. Perhaps you meant to address Mike instead? – Jon Oct 02 '12 at 12:12
1

Why don't implement that as following:

int max = Int32.MinValue;
using(var reader = File.OpenText(path)) 
{
    while ((line = reader.ReadLine()) != null)
    {
         int current;
         if (Int32.TryParse(line, out current))
             max = Math.Max(max, current);
     }    
}
STO
  • 10,390
  • 8
  • 32
  • 32
0

You are reading the whole file into memmory (List lines )

I guess you could just read a line at a time and keep the highest number?

it will save you a lot of ram.

Stig
  • 1,323
  • 16
  • 22
  • Each line takes .5 sec to process, that's why it big time faster to read them up and then process. We gain a lot bu doing that, which is confirmed by performance tests. – user1514042 Oct 02 '12 at 11:26
0

It appears that you are always loading entire file in the memory. At the same time, you are also creating managed objects (List) for each line of the file.

There is no reason that your memory usage will grow.

Please post rest of the code also, I doubt that you are somewhere having reference to this list which is in use and hence it is not being disposed.

Murtuza Kabul
  • 6,438
  • 6
  • 27
  • 34
0

Alright, if you want a solution where you can read the entire file in at once, because you're sure that you need that performance gain, then let's do it like this so that way you don't have a memory issue.

public static int GetMaxForFile(int i) 
{ 
    string path = GetPath(i); 

    var lines = new List<string>(File.ReadAllLines(path));

    // you MUST perform all of your processing here ... you have to let go
    // of the List<string> variable ...
    int max = Math.Max(max, lines.Max(t=>int.Parse(t)));

    // this may be redundant, but it will cause GC to clean up immediately
    lines.Clear();
    lines = null;

    return max;
} 
Mike Perrenoud
  • 66,820
  • 29
  • 157
  • 232