c# - Processing a Large File, Line by Line - Optimization

Question

So I'm currently trying to research the best approach for dealing with processing a large file in c#. We currently have a large file with 10 million + lines of data. Originally, my client said the file would contain tens of thousands of lines so we previously wrote each line to a new file and had it picked up by our interface engine for processing. Now however, we're seeing these files come in much larger then expected and processing takes a weekend. I'm trying to optimize our logic and am researching the best way to go about it. I looked into trying to have multiple threads reading from a single file but the mechanical bottleneck of disk I/O doesn't provide much room for improvement there. The next method would be to read each line and process each line (or group of lines) on a separate thread. This will give us some optimization since the processing of each line can be done concurrently. I know some people have extensive experience in dealing with processing very large files and was hoping to get some feedback on my approach or maybe get some alternative ways to tackle this issue.

Any thoughts and comments are appreciated.

Are you visually displaying the progress of your processing? This can often be a bottleneck that can be reduced fairly heavily by increasing the rendering interval. — Simon Whitehead, Nov 26 '12 at 22:06
The common optimization pattern here is double buffering. So a single reader reads to the buffer and then multiple threads process it — zerkms, Nov 26 '12 at 22:06
How large is an individual line? Does processing time currently scale linearly? (x seconds per line?) Do you have a restriction on which version of C# — Mr.Mindor, Nov 26 '12 at 22:41

Joe · Answer 1 · 2012-11-26T22:27:39.520

Now however, we're seeing these files come in much larger then expected and processing takes a weekend

Reading a file with 10 million lines doesn't take a weekend, or anything like, so any optimization effort should probably focus on processing the data read from the file rather than the file I/O.

You don't say what processing you're doing, but, for example, if you're updating a database, you can achieve dramatic performance improvements by batching up updates into transactions - say one transaction for every 10,000 lines.

Given that it's taking all weekend, it's unlikely to be CPU-bound, so I'm not sure multithreading is the first avenue to explore.

Provide some more info about what you're doing with the data if you want more help.

c# - Processing a Large File, Line by Line - Optimization

1 Answers1