7

I have a huge line-separated text file and I want to make some calculations on each line. I need to make a multithreaded program to process it because it is the processing of each line that takes the most time to complete rather than reading each line. (the bottleneck lies in the CPU processing, rather than the IO)

There are two options I came up with:

1) Open the file from main thread, create a lock on the file handle and pass the file handle around the worker threads and then let each worker read-access the file directly

2) Create a producer / consumer setup where only the main thread has direct read-access to the file, and feeds lines to each worker thread using a shared queue

Things to know:

  • I am really interested in speed performance for this task
  • Each line is independent
  • I am working this in C++ but I guess the issue here is a bit language-independent

Which option would you choose and why?

Alexandros
  • 4,425
  • 4
  • 23
  • 21
  • how much processors will you use and how big the file is? – amit Feb 26 '12 at 12:52
  • the file is around 20GB and in future implementations will be even bigger. Currently I am running on 4 cores – Alexandros Feb 26 '12 at 12:54
  • 1
    @Alexandros: I know I am pretty late to answer :). But wouldnt assigning a block of lines to each thread be much easier ? You can precalculate the blocksize for each thread using a single file pointer and then later each thread open the file and seek to that pre calculated position. I think this will be more easier and faster approach – Arunmu Feb 26 '12 at 14:09
  • This sounds like a pretty smart idea man, thanks! The only thing is that multiple threads reading distant parts in the same file might create some disk-reading overhead, but I don't think that will be too much. – Alexandros Feb 27 '12 at 13:16

5 Answers5

5

I would suggest the second option, since it will be more clear design wise and less complicated than first option. First option is less scalable and require additional communication among thread in order to synchronize they progress on file lines. While in second option you have one dispatcher which deals with IO and initiate workers threads to starts they computation, and each computational thread is completely independent from each other, hence allows you scaling. Moreover in the second option you separate your logic in more clear way.

Artem Barger
  • 40,769
  • 9
  • 59
  • 81
  • +1 for P-C queue. I would suggest a class for the inter-thread comms that buffers some useful number of lines so that each processing thread spends most of its time actually processing. I would flow-control this system by creating a pool of these line-buffer objects at startup, (ie. another P-C queue loaded up with them). – Martin James Feb 26 '12 at 13:28
1

If we are talking about massively large file, which needs to be processed with a large cluster - MapReduce is probably the best solution.

The framework allows you great scalability, and already handles all the dirty work of managing the workers and tolerating failures for you.
The framework is specifically designed to recieve files read from file system [originally for GFS] as input.

Note that there is an open source implementation of map-reduce: Apache Hadoop

amit
  • 175,853
  • 27
  • 231
  • 333
  • 1
    There is not necessary exist the right case to use MapReduce. What if there is no actual reduce notion in his case? – Artem Barger Feb 26 '12 at 13:01
  • @ArtemBarger: map-reduce is oftenly used with identity function as the reduce step. A good example is map-reduce based sort. – amit Feb 26 '12 at 13:02
  • I know it, but the question was, what if Alexandros use case doesn't fits this notion. – Artem Barger Feb 26 '12 at 13:03
0

If each line is really independent and processing is much slower than reading the file, what you can do is to read all the data at once and store it in an array, such that each line represents element of an array.

Then all your threads can do the processing in parallel. For example, if you have 200 lines and 4 threads, each thread could perform calculation on 50 lines. Morever, Since this method would be embarrassingly parallel, you could easily use OpenMP for that.

MetallicPriest
  • 29,191
  • 52
  • 200
  • 356
0

I would suggest the second option because it is definitely better design-wise and would allow you to have better control over the work that the worker threads are doing.

Moreover that would increase the performance since the inter-thread communication in that case is the minimum of the two options you described

Lefteris
  • 3,196
  • 5
  • 31
  • 52
  • Since when copy-past of previous answers is counted as correct answer? – Artem Barger Feb 26 '12 at 13:02
  • 2
    @ArtemBarger I did not see your answer before I posted mine, just typed in my opinion while working on sth else hence was a really slow. The OP did good to accept your answer as more complete, faster and generally better but there is no reason to accuse people of copy pasting nor downvoting for that sole reason – Lefteris Feb 26 '12 at 13:11
  • I'm really sorry, but first sentence fits mine almost completely moreover your posting time differs from mine in 10 min. As long I as know SO warns you before you post regarding new answers which came beforehand, hence at first sight it looks very weird. – Artem Barger Feb 26 '12 at 14:03
  • @ArtemBarger It's okay and I should have pressed the update button to see that someone else has posted an answer already better than mine but I just didn't. It's largely my fault, but i just wanted to make clear to you that I did not copy paste and had no reason to, hence i don't think i deserved the downvote. – Lefteris Feb 26 '12 at 14:11
  • OK, I understand and I'm sorry again. – Artem Barger Feb 27 '12 at 07:10
0

Another option is to memory map the file and maintaining a shared structure properly handling mutual exclusion of the threads.

Patrick Schlüter
  • 11,394
  • 1
  • 43
  • 48