Binary Search - How to load +5M records from a file into Range[] array?

Question

This question is a follow up to my previous question regarding binary search (Fast, in-memory range lookup against +5M record table).

I have sequential text file, with over 5M records/lines, in the format below. I need to load it into Range<int>[] array. How would one do that in a timely fashion?

File format:

start int64,end int64,result int
start int64,end int64,result int
start int64,end int64,result int
start int64,end int64,result int
...

not sure why there are so many downvotes without comments... seems like a perfectly reasonable question... — g19fanatic, Mar 07 '13 at 18:00

score 0 · Answer 1 · edited May 23 '17 at 12:11

0

This is a typical (?) producer-consumer problem which can be solved using multiple threads. In your case the producer is reading data from disk and the consumer is parsing the lines and populating the array. I can see two different cases:

Producer is (much) faster than the consumer: in this case you should try using more consumer threads;
Consumer is (much) faster than the producer: you can't do very much to speed up things other than affecting your hardware configuration such as buying a faster hard disk or using a RAID 0. In this case I wouldn't even use a multithreading solution because it's not worth the added complexity.

This question might help you implementing that in C#.

edited May 23 '17 at 12:11

Community

1
1

answered Mar 07 '13 at 16:33

BlackBear

22,411
10
48
86

My guess is he'll end up in the second situation. Disk IO is almost certainly going to be the bottleneck here. – Servy Mar 07 '13 at 17:25
@Servy yes, I was thinking the same – BlackBear Mar 07 '13 at 17:51

score 0 · Accepted Answer · answered Mar 07 '13 at 17:21

I'm going to assume you have a good disk. Scan through the file once and count the number of entries. If you can guarantee your file has no blank lines, then you can just count the number of newlines in it -- don't actually parse each line.

Now you can allocate your array once with exactly that many entries. This avoids excessive re-allocations of the array:

var numEntries = File.ReadLines(filepath).Count();
var result = new Range<int>[numEntries];

Now read the file again and create your range objects with code something like:

var i = 0;
foreach (var line in File.ReadLines(filepath))
{
   var parts = line.Split(',');
   result[i++] = new Range<int>(long.Parse(parts[0]), long.Parse(parts[1]), int.Parse(parts[2]);
}

return result;

Sprinkle in some error handling as desired. This code is easy to understand. Try it out in your target environment. If it is too slow, then you can start optimizing it. I wouldn't optimize prematurely though because that will lead to much more complex code that might not be needed.

You're going to end up spending way more time counting the lines at the start than you are just using a `List` and dealing with the re-allocated internal buffers. Disk IO is a lot more expensive than copying data already in memory, as long as you aren't dealing with data sets so large that they won't fit into memory (which is unlikely). — Servy, Mar 07 '13 at 17:26

Binary Search - How to load +5M records from a file into Range[] array?

2 Answers2

Linked