2

So i have a large file which has ~2 million lines. The file reading is a bottleneck in my code. Any suggested ways and expert opinion to read the file faster is welcome. Order of reading lines from that text file is unimportant. All lines are pipe '|' separated fixed length records.

What i tried? I started parallel StreamReaders and made sure that resource is locked properly but this approach failed as i now had multiple threads fighting to get hold of the single StreamReader and wasting more time in locking etc thereby making the code slow down further.

One intuitive approach is to break the file and then read it, but i wish to leave the file intact and still be somehow able to read it faster.

displayName
  • 13,888
  • 8
  • 60
  • 75
  • 5
    Are you sure the bottleneck is the file reader and not the disk IO? – Mysticial Jul 11 '14 at 01:04
  • Have you measured the performance? My oppinion is that the StreamReader will read with little overhead. Maybe your are reading small chucks of data. Try to read large blocks and perform the line splitting in memory. – Stefan Jul 11 '14 at 01:05
  • 2
    Agree with ^^^^, the multi threading would come into play processing the what is read, however you should have no problem reading the file with multiple readers (as long as you open it read only and shared)... – T McKeown Jul 11 '14 at 01:07
  • 1
    I don't think that *parallelize* file IO will help much. Most costly operation with disk IO is the disk head moving among tracks.... – EZI Jul 11 '14 at 01:07
  • @Mysticial : updated. It is 'file reading' not 'file reader'. EZI: i get your point... so will i have to settle for this performance? – displayName Jul 11 '14 at 01:09
  • you can read a file with multiple readers... files are opened and read by multiple processes/threads all the time... – T McKeown Jul 11 '14 at 01:11
  • Are the "lines" fixed length records? If so, multiple threads could start at various points in the file and read forward. If the file is on a hard disk then the contention between threads would likely slow down the process. If the "lines" are variable length then it is a little harder to have a thread start in the middle. Moving the file to a faster device, e.g. suitable RAID set or SSD would help. Having a single thread that does all of the reading into a series of large buffers and letting other threads process the data is probably the best arrangement. – HABO Jul 11 '14 at 02:12
  • @HABO: Can you elaborate how your idea defies what EZI mentioned in the comments above? – displayName Jul 11 '14 at 04:29
  • If you are moving one set of physical heads on a single HDD then contention between threads trying to read different parts of the file simultaneously will cause thrashing. If, OTOH, you are reading stripes off a RAID set in parallel then you may increase the bandwidth. SSDs, lacking physical heads, also perform better in this type of application. And a RAID set of SSDs ... . – HABO Jul 11 '14 at 13:29
  • @HABO : Got it. I am looking for software based solutions. Not possible for me to change the underlying hardware. – displayName Jul 11 '14 at 13:43

2 Answers2

1

I would try maximizing my buffer size. The default size is 1024, increasing this should increase performance. I would suggest trying other buffer size options.

StreamReader(Stream, Encoding, Boolean, Int32) Initializes a new instance of the StreamReader class for the specified stream, with the specified character encoding, byte order mark detection option, and buffer size.

etr
  • 1,252
  • 2
  • 8
  • 15
  • 1
    I suggest you do a benchmark comparison between all your options than: StreamReader.ReadLine , File.ReadLines , File.ReadAllLines and String.Split . Your environment might yield different results but for me the StreamReader was the fastest. – etr Jul 11 '14 at 01:30
0

I understand that my problem is not related to software. It is a 'mechanical' problem. Unless there is a possibility to perform changes in hardware, there is no way to improve the reading performance. Why is that? There is one head only to read from the disk and therefore even if i try to read file from both ends for example, it is that same reader which will now have to move even more to read file from both ends for the two threads. Hence it is wiser to let the reader read sequentially and that is the maximum performance achievable.

Thank you all for the explanations. That helped me understand this concept. It may be a very basic and straightforward point for most people here on stackoverflow, but I really learned something about file reading and hardware performance and understood the things taught to me in college, from this question.

displayName
  • 13,888
  • 8
  • 60
  • 75