4

I feel embarrassed to ask this question as I feel like I should already know. However, given I don't....I want to know how to read large files from disk to a database without getting an OutOfMemory exception. Specifically, I need to load CSV (or really tab delimited files).

I am experimenting with CSVReader and specifically this code sample but I'm sure I'm doing it wrong. Some of their other coding samples show how you can read streaming files of any size, which is pretty much what I want (only I need to read from disk), but I don't know what type of IDataReader I could create to allow this.

I am reading directly from disk and my attempt to ensure I don't ever run out of memory by reading too much data at once is below. I can't help thinking that I should be able to use a BufferedFileReader or something similar where I can point to the location of the file and specify a buffer size and then CsvDataReader expects an IDataReader as it's first parameter, it could just use that. Please show me the error of my ways, let me be rid of my GetData method with it's arbitrary file chunking mechanism and help me out with this basic problem.

    private void button3_Click(object sender, EventArgs e)
    {   
        totalNumberOfLinesInFile = GetNumberOfRecordsInFile();
        totalNumberOfLinesProcessed = 0; 

        while (totalNumberOfLinesProcessed < totalNumberOfLinesInFile)
        {
            TextReader tr = GetData();
            using (CsvDataReader csvData = new CsvDataReader(tr, '\t'))
            {
                csvData.Settings.HasHeaders = false;
                csvData.Settings.SkipEmptyRecords = true;
                csvData.Settings.TrimWhitespace = true;

                for (int i = 0; i < 30; i++) // known number of columns for testing purposes
                {
                    csvData.Columns.Add("varchar");
                }

                using (SqlBulkCopy bulkCopy = new SqlBulkCopy(@"Data Source=XPDEVVM\XPDEV;Initial Catalog=MyTest;Integrated Security=SSPI;"))
                {
                    bulkCopy.DestinationTableName = "work.test";

                    for (int i = 0; i < 30; i++)
                    {
                        bulkCopy.ColumnMappings.Add(i, i); // map First to first_name
                    }

                    bulkCopy.WriteToServer(csvData);

                }
            }
        }
    }

    private TextReader GetData()
    {
        StringBuilder result = new StringBuilder();
        int totalDataLines = 0;
        using (FileStream fs = new FileStream(pathToFile, FileMode.Open, System.IO.FileAccess.Read, FileShare.ReadWrite))
        {
            using (StreamReader sr = new StreamReader(fs))
            {
                string line = string.Empty;
                while ((line = sr.ReadLine()) != null)
                {
                    if (line.StartsWith("D\t"))
                    {
                        totalDataLines++;
                        if (totalDataLines < 100000) // Arbitrary method of restricting how much data is read at once.
                        {
                            result.AppendLine(line);
                        }
                    }
                }
            }
        }
        totalNumberOfLinesProcessed += totalDataLines;
        return new StringReader(result.ToString());
    }
Adrian
  • 5,603
  • 8
  • 53
  • 85
Mr Moose
  • 5,946
  • 7
  • 34
  • 69
  • This is what virtual memory is for. The real problem is address space. – spraff Jan 24 '12 at 14:09
  • Have you tried using the FileHelpers class for parsing CSV data? http://www.filehelpers.com/ – Kane Jan 27 '12 at 22:59
  • @Kane, No I wasn't aware of it. It looks OK though. My files have some meta data at the beginning, so it'd need to cater for that. I'll take a look at some of the site and see if it might be of some use. Thanks. – Mr Moose Jan 28 '12 at 14:10

6 Answers6

4

Actually your code is reading all data from file and keep into TextReader(in memory). Then you read data from TextReader to Save server.

If data is so big, data size in TextReader caused out of memory. Please try this way.

1) Read data (each line) from File.

2) Then insert each line to Server.

Out of memory problem will be solved because only each record in memory while processing.

Pseudo code

begin tran

While (data = FilerReader.ReadLine())
{
  insert into Table[col0,col1,etc] values (data[0], data[1], etc)
}

end tran
deft_code
  • 57,255
  • 29
  • 141
  • 224
Thit Lwin Oo
  • 3,388
  • 3
  • 20
  • 23
  • while I appreciate the answer, it isn't quite what I am looking for. The code in my example is a modification. My original code read an entire file to the end (causing the OOM exception). The purpose of my question is to avoid Row By Agonising Row (RBAR) inserts and utilise the benefits of SQLBulkCopy to quickly load the data in. You solution will certainly solve my OOM problem, but it will introduce performance issues that won't please my users :) – Mr Moose Feb 03 '12 at 01:59
  • You are right, it will make one record one transaction and block while inserting. In this case, I used batch of records per one transaction. Mean.. I insert 100 records (for example) per one transaction by using XML format in SQL database. – Thit Lwin Oo Feb 03 '12 at 03:47
3

Probably not the answer you're looking for but this is what BULK INSERT was designed for.

MyItchyChin
  • 13,733
  • 1
  • 24
  • 44
  • Yeah I agree. The problem I have is, not every row in the file is to be processed. The file contains a lot of meta data headers at the top, and only certain rows (beginning with a D\t) are to be processed. I'm not sure BulkInsert can help me with this. – Mr Moose Feb 03 '12 at 02:00
  • @MrMoose - You're probably better off writing a preprocessor to cycle through your CSV files and prep them for insert. – MyItchyChin Feb 03 '12 at 03:08
  • What would you suggest for the preprocessor? Extract the data rows to another file completely and then load that using SqlBulkCopy/BulkInsert? Or are you suggesting that the preprocessor prepares the data in memory. That is what I am trying to achieve above, but it seems like a less than ideal method of breaking up the files into manageable sized chunks so each chunk can be handled without risking an out of memory exception. – Mr Moose Feb 03 '12 at 04:51
  • @MrMoose - Try to formulate your solution based on external constraints like time, environment, and performance before imposing artificial constraints based on desire or preference. If you're looking for the smallest memory foot print then creating a stream reader and a stream writer, reading one line from your file at a time, filtering it, writing it back out to a new file, and then using native bulk upload techniques on the new processed file will yield the smallest memory foot print. If you've got to do it all in memory then I'd say skip doing bulk inserts and do it one row at a time. – MyItchyChin Feb 03 '12 at 14:15
1

I would just add using BufferedFileReader with the readLine method and doing exatcly in the fashion above.

Basically understanding the resposnisbilties here.

BufferedFileReader is the class reading data from file (buffe wise) There should be a LineReader too. CSVReader is a util class for reading the data assuming that its in correct format.

SQlBulkCopy you are anywsay using.

Second Option

You can go to the import facility of database directly. If the format of the file is correct and thw hole point of program is this only. that would be faster too.

manocha_ak
  • 904
  • 7
  • 19
1

I think you may have a red herring with the size of the data. Every time I come across this problem, it's not the size of the data but the amount of objects created when looping over the data.

Look in your while loop adding records to the db within the method button3_Click(object sender, EventArgs e):

TextReader tr = GetData();
using (CsvDataReader csvData = new CsvDataReader(tr, '\t'))

Here you declare and instantiate two objects each iteration - meaning for each chunk of file you read you will instantiate 200,000 objects; the garbage collector will not keep up.

Why not declare the objects outside of the while loop?

TextReader tr = null;
CsvDataReader csvData = null;

This way, the gc will stand half a chance. You could prove the difference by benchmarking the while loop, you will no doubt notice a huge performance degradation after you have created just a couple of thousand objects.

Sean Blaney
  • 101
  • 3
  • Where are you seeing that I'm creating 200,000 objects. Even if that were the case, I'm pretty confident that the garbage collector could keep up. My original problem was caused by reading a complete file into memory, and then also trying to bind that entire file to a datagrid. I'm now moving away from that given it obviously doesn't work for larger files, but I'm simply curious to find a mechanism of loading large file into a database (without having to resort to processing row by row) and not have me to run into memory issues. – Mr Moose Feb 03 '12 at 05:23
0

pseudo code:

while (!EOF) {
   while (chosenRecords.size() < WRITE_BUFFER_LIST_SIZE) {
      MyRecord record = chooseOrSkipRecord(file.readln());
      if (record != null) {
         chosenRecords.add(record)
      }
   }  
   insertRecords(chosenRecords) // <== writes data and clears the list
}

WRITE_BUFFER_LIST_SIZE is just a constant that you set... bigger means bigger batches and smaller means smaller batches. A size of 1 is RBAR :).

If your operation is big enough that failing partway through is a realistic possibility, or if failing partway through could cost someone a non-trivial amount of money, you probably want to also write to a second table the total number of records processed so far from the file (including the ones you skipped) as part of the same transaction so that you can pick up where you left off in the event of partial completion.

Gus
  • 6,719
  • 6
  • 37
  • 58
  • This is essentially what I am doing. While it works, I kinda feel that there must be a better way also. It is the arbitrary size in my example that seems to make me feel like it is a substandard solution. For mine, I will base it a maximum size I want to read at once. So if a file I am processing is say 100M and I only want to read 10M at a time and I have 1M rows in the file, then I'd perform some a basic calc (ie. 1M rows in 100M, therefore roughly 100k rows in 10M) and read the file in chunks of 100k rows. It'll work, but as I say..it seems messy and I feel there must be a better way. – Mr Moose Feb 03 '12 at 05:41
0

Instead of reading csv rows one by one and inserting into db one by one I suggest read a chunk and insert it into database. Repeat this process until the entire file has been read.

You can buffer in memory, say 1000 csv rows at a time, then insert them in the database.

int MAX_BUFFERED=1000;
int counter=0;
List<List<String>> bufferedRows= new ...

while (scanner.hasNext()){
  List<String> rowEntries= getData(scanner.getLine())
  bufferedRows.add(rowEntries);

  if (counter==MAX_BUFFERED){
    //INSERT INTO DATABASE
    //append all contents to a string buffer and create your SQL INSERT statement
    bufferedRows.clearAll();//remove data so it could be GCed when GC kicks in
  }
}
Adrian
  • 5,603
  • 8
  • 53
  • 85
  • Thanks Adrian, see my comment to @Gus below. That is essentially what I am doing (using SqlBulkCopy rather than a large INSERT statement), but I just wonder if there is a better way. How do I determine an optimum number of rows to read? Depends on so many factors such as size of row, size of available memory etc. – Mr Moose Feb 03 '12 at 05:45
  • you need to perform tests on the machine your doing the reading. See how it performs when reading half full memory, almost full memory, full memory, double memory. – Adrian Feb 03 '12 at 05:48
  • Hmmm, it is going to be deployed in a client server type scenario and will be run on various machines, each of a different configuration. It is unlikely I'll find a scenario that fits all. I'll have to choose something that I feel is optimal and then make it configurable from the UI. – Mr Moose Feb 03 '12 at 05:56
  • Probably read 100 lines and insert them. Bur you can get free memory and based on that perhaps estimate how many rows to buffer based on some benchmark tests. – Adrian Feb 03 '12 at 06:28