3

I have multiple 1.5 GB CSV Files which contain billing information on multiple accounts for clients from a service provider. I am trying to split the large CSV file into smaller chunks for processing and formatting the data inside it.

I do not want to roll out my own CSV parser but this is something I haven't seen yet so please correct me if I am wrong. The 1.5GB files contains information in the following order: account information, account number, Bill Date, transactions , Ex gst , Inc gst , type and other lines.

note that BillDate here means the date when the invoice was made, so occassionally we have more than two bill dates in the same CSV.

Bills are grouped by : Account Number > Bill Date > Transactions.

Some accounts have 10 lines of Transaction details, some have over 300,000 lines of Transaction details. A large 1.5GB CSV file contains around 8million lines of data (I used UltraEdit before) to cut paste into smaller chunks but this has become very inefficient and a time consuming process.

I just want to load the large CSV files in my WinForm, click a button, which will split this large files in chunks of say no greater than 250,000 lines but some bills are actually bigger than 250,000 lines in which case keep them in one piece and not split accounts across multiple files since they are ordered anyway. Also I do not wan't accounts with multiple bill date in CSV in which case the splitter can create another additional split.

I already have a WinForm application that does the formatting of the CSV in smaller files automatically in VS C# 2010.

Is it actually possible to process this very large CSV files? I have been trying to load the large files but MemoryOutOfException is an annoyance since it crashes everytime and I don't know how to fix it. I am open to suggestions.

Here is what I think I should be doing:

  • Load the large CSV file (but fails since OutOfMemoryException). How to solve this?
  • Group data by account name, bill date, and count the number of lines for each group.
  • Then create an array of integers.
  • Pass this array of integers to a file splitter process which will take these arrays and write the blocks of data.

Any suggestions will be greatly appreciated.

Thanks.

Erik Funkenbusch
  • 92,674
  • 28
  • 195
  • 291
Saidur Rahman
  • 420
  • 2
  • 6
  • 19
  • maybe you can load the csv file line by line, then import the line in a local temporary DB (sql compact or sql lite to run without a full featured server), and finally perform the group by query on the DB ? – Steve B Oct 28 '11 at 04:45

4 Answers4

2

You can use CsvReader to stream through and parse the data, without needing to load it all into memory in one go.

Chris Fulstow
  • 41,170
  • 10
  • 86
  • 110
1

Yea about that.... being out of memory is going to happen with files that are HUGE. You need to take your situation seriously.

As with most problems, break everything into steps.

I have had a similar type of situation before (large data file in CSV format, need to process, etc).

What I did:

Make step 1 of your program suite or whatever, something that merely cuts your huge file into many smaller files. I have broken 5GB zipped up PGP encrypted files (after decryption...thats another headache) into many smaller pieces. You can do something simple like numbering them sequentially (i.e. 001, 002, 003...)

Then make an app to do the INPUT processing. No real business logic here. I hate FILE IO with a passion when it comes to business logic and I love the warm fuzzy feeling of data being in a nice SQL Server DB. That's just me. I created a thread pool and have N amount of threads (like 5, you decide how much your machine can handle) read those .csv part files you created.

Each thread reads one file. One to one relationship. Because it is file I/O, make sure you only dont have too many running at the same time. Each thread does the same basic operation. Reads in data, puts it in a basic structure for the db (table format), does lots of inserts, then ends the thread. I used LINQ to SQL because everything is strongly typed and what not, but to each their own. The better the db design the better for you later to do logic.

After all threads have finished executing, you have all the data from the original CSV in the database. Now you can do all your business logic and do whatever from there. Not the prettiest solution, but I was forced into developing that given my situation/data flow/size/requirements. You might go with something completely different. Just sharing I guess.

Issa Fram
  • 2,556
  • 7
  • 33
  • 63
  • yours solution is what I actually had in mind. But for me it its a egg and a chicken scenario since I can't seem to split the large CSV file since it gives me OutOfMemoryException at the outset. If I split the CSV file into random chunks from an external application I lose the sorting and if I lose the sorting, it's going to be a mess. The billing solution is 30 years old and uses white spaces rather than sequences in file to replicate good old fashion dot printer design. – Saidur Rahman Oct 28 '11 at 05:03
  • how large were the splitted files you were trying to make? I think I went with a max of 50MB when I did it. I created a program that took a number as input for the max size for the split size and the file input. It did everything from there. – Issa Fram Oct 28 '11 at 05:06
  • file split successful. I used FXFisherman's CSV Splitter (split them into 250,000 pieces). How do I insert them in SQL CE in order using threads you mentioned? – Saidur Rahman Oct 28 '11 at 05:39
  • If order matters you need to manage your own little queue of requests. If order doesnt matter, it can be a straight data dump and threads can go at available files to read as soon as possible. Remember that you can do file IO (reading the file) alot faster than the inserts into the database. I remember reading a certain amount (buffer) into memory (maximum amount), having a thread try to insert sequentially into the db. If buffer was filled, I would stop reading the file (to avoid any memory issues). Basically if order matters, just use one file/thread at a time. That will be easier. – Issa Fram Oct 28 '11 at 05:59
0

You can use an external sort. I suppose you'd have to do an initial pass through the file to identify proper line boundaries, as CSV records are probably not of a fixed length.

Hopefully, there might be some ready-made external sort implementations for .NET that you could use.

Vlad
  • 18,195
  • 4
  • 41
  • 71
0

There's a very useful class in the Microsoft.VisualBasic.FileIO namespace that I've used for dealing with CSV files - the TextFieldParser Class.

It might not help with the large file size, but it's built-in and handles quoted and non-quoted fields (even if mixed in the same line). I've used it a couple of times in projects at work.

Despite the assembly name, it can be used with C#, in case you're wondering.

Tim
  • 28,212
  • 8
  • 63
  • 76