-1

This question is a follow up to my previous question regarding binary search (Fast, in-memory range lookup against +5M record table).

I have sequential text file, with over 5M records/lines, in the format below. I need to load it into Range<int>[] array. How would one do that in a timely fashion?

File format:

start int64,end int64,result int
start int64,end int64,result int
start int64,end int64,result int
start int64,end int64,result int
...
Community
  • 1
  • 1
enlightenedOne
  • 161
  • 2
  • 12

2 Answers2

0

This is a typical (?) producer-consumer problem which can be solved using multiple threads. In your case the producer is reading data from disk and the consumer is parsing the lines and populating the array. I can see two different cases:

  • Producer is (much) faster than the consumer: in this case you should try using more consumer threads;
  • Consumer is (much) faster than the producer: you can't do very much to speed up things other than affecting your hardware configuration such as buying a faster hard disk or using a RAID 0. In this case I wouldn't even use a multithreading solution because it's not worth the added complexity.

This question might help you implementing that in C#.

Community
  • 1
  • 1
BlackBear
  • 22,411
  • 10
  • 48
  • 86
0

I'm going to assume you have a good disk. Scan through the file once and count the number of entries. If you can guarantee your file has no blank lines, then you can just count the number of newlines in it -- don't actually parse each line.

Now you can allocate your array once with exactly that many entries. This avoids excessive re-allocations of the array:

var numEntries = File.ReadLines(filepath).Count();
var result = new Range<int>[numEntries];

Now read the file again and create your range objects with code something like:

var i = 0;
foreach (var line in File.ReadLines(filepath))
{
   var parts = line.Split(',');
   result[i++] = new Range<int>(long.Parse(parts[0]), long.Parse(parts[1]), int.Parse(parts[2]);
}

return result;

Sprinkle in some error handling as desired. This code is easy to understand. Try it out in your target environment. If it is too slow, then you can start optimizing it. I wouldn't optimize prematurely though because that will lead to much more complex code that might not be needed.

Brandon
  • 38,310
  • 8
  • 82
  • 87
  • You're going to end up spending way more time counting the lines at the start than you are just using a `List` and dealing with the re-allocated internal buffers. Disk IO is a lot more expensive than copying data already in memory, as long as you aren't dealing with data sets so large that they won't fit into memory (which is unlikely). – Servy Mar 07 '13 at 17:26