Read large txt file multithreaded?

Question

I have large txt file with 100000 lines. I need to start n-count of threads and give every thread unique line from this file.

What is the best way to do this? I think I need to read file line by line and iterator must be global to lock it. Loading the text file to list will be time-consuming and I can receive OutofMemory exception. Any ideas?

create unique n random numbers, order in ascending order, use `File.ReadLines`, take lines at correct positions and pass them to threads — Ilya Ivanov, Jun 19 '13 at 10:02
Can't you use this: http://msdn.microsoft.com/en-us/library/dd460720.aspx ? — Daan Timmer, Jun 19 '13 at 10:04

dtb · Answer 1 · 2013-06-19T10:13:08.427

43

You can use the File.ReadLines Method to read the file line-by-line without loading the whole file into memory at once, and the Parallel.ForEach Method to process the lines in multiple threads in parallel:

Parallel.ForEach(File.ReadLines("file.txt"), (line, _, lineNumber) =>
{
    // your code here
});

edited Jun 19 '13 at 10:13

answered Jun 19 '13 at 10:05

dtb

213,145
36
401
431

I agree. The only thing I wanna add is that ReadLines enumerable should be partitioned. Because each parallel execution should be for something heavy. – ozgur May 19 '16 at 12:47
Something to keep in mind: Parallel.Foreach will spawn a bunch of 'workers', then wait until *all* of them are done and only then spawn the next bunch of workers. So if processing time per line can differ it'd be advisable to use Jake Drew's approach (producer/consumer pattern) – Steffen Winkler Jul 26 '16 at 12:06
Here can be thrown OutOfMemory Exception if the file is too large. – sinitram Jul 10 '17 at 13:44
2

https://dotnetfiddle.net/wX7VhA may be of interest @SteffenWinkler. Note that the 3rd item starts after the 1st ends - not after the 2nd ends. I am not convinced your bunching concern is valid. – mjwills Aug 24 '18 at 10:56
2

@mjwills Huh, after some further playing around/testing I have to agree with you. My initial observation must have been a coincidence or I didn't pay enough attention to what was going on. However, one thing I would like to note is that Parallel.Foreach seems to divide the list of entries into the amount of threads available and each thread executes a sublist. So thread 1 gets entries 1 - 20 and thread 2 gets entries 21 - 40 instead of just taking the next entry available. – Steffen Winkler Sep 12 '18 at 09:50
1

Yep, it does partition like that. https://stackoverflow.com/questions/5352491/ordered-plinq-forall/20929046#20929046 may be worth a read if you don't want that behaviour. – mjwills Sep 12 '18 at 09:53
1

Also consider removing your earlier (incorrect) comments. – mjwills Sep 12 '18 at 09:58
Small correction that may leads to error. The third parameter of the lambda function (body) should be called "readingIndex". As human, the first line is numbered 1 and not 0, but as programmer, we are used to have the first element index of an array to be 0 (although it can differ, but nowadays pretty rare). – Master DJon Feb 17 '22 at 13:53

Jake Drew · Answer 2 · 2014-10-10T05:51:32.610

After performing my own benchmarks for loading 61,277,203 lines into memory and shoving values into a Dictionary / ConcurrentDictionary() the results seem to support @dtb's answer above that using the following approach is the fastest:

Parallel.ForEach(File.ReadLines(catalogPath), line =>
{

});

My tests also showed the following:

File.ReadAllLines() and File.ReadAllLines().AsParallel() appear to run at almost exactly the same speed on a file of this size. Looking at my CPU activity, it appears they both seem to use two out of my 8 cores?
Reading all the data first using File.ReadAllLines() appears to be much slower than using File.ReadLines() in a Parallel.ForEach() loop.
I also tried a producer / consumer or MapReduce style pattern where one thread was used to read the data and a second thread was used to process it. This also did not seem to outperform the simple pattern above.

I have included an example of this pattern for reference, since it is not included on this page:

var inputLines = new BlockingCollection<string>();
ConcurrentDictionary<int, int> catalog = new ConcurrentDictionary<int, int>();

var readLines = Task.Factory.StartNew(() =>
{
    foreach (var line in File.ReadLines(catalogPath)) 
        inputLines.Add(line);

        inputLines.CompleteAdding(); 
});

var processLines = Task.Factory.StartNew(() =>
{
    Parallel.ForEach(inputLines.GetConsumingEnumerable(), line =>
    {
        string[] lineFields = line.Split('\t');
        int genomicId = int.Parse(lineFields[3]);
        int taxId = int.Parse(lineFields[0]);
        catalog.TryAdd(genomicId, taxId);   
    });
});

Task.WaitAll(readLines, processLines);

Here are my benchmarks:

enter image description here

I suspect that under certain processing conditions, the producer / consumer pattern might outperform the simple Parallel.ForEach(File.ReadLines()) pattern. However, it did not in this situation.

Hi Jake, thanks for sharing the benchmark. While I agree that it's essential to use `File.ReadLines()` to avoid large memory consumption, does `Parallel.ForEach(File.ReadLines())` really outperform single thread processing? A stream is a sequential by design and also the hardware only supports reading one thing a time so by using multiple threads to process the result may increase overheads of blocking and context switch; it would be interesting to see the performance metrics of simply processing the result of `File.ReadLines()` using a single thread. — dragonfly02, May 15 '18 at 18:41
Go for it! It's easy enough to benchmark. To answer your question, I think it depends on how much processing to each line that your foreach loop is performing. Think of a scenario where the stream is providing work to the thread pool in the Parallel.ForEach() each line that is read in, is then passed to the next available thread for processing. If you have an empty for each loop, then reading with a single thread might be faster. However, this is typically not the case. — Jake Drew, May 15 '18 at 22:03
While the stream may be sequential by design, it is also implemented as an IEnumerable which uses Yield Returns. As a result, in a single or muti threaded scenario, the File.ReadLines() behaves the same "yielding" processing until the next line request, rather from a single or multiple threads. It will all really boil down to how much work you are doing on each file line regarding the speed up (if any) you get from parallel processing! — Jake Drew, May 15 '18 at 22:04
I noticed that the BlockingCollection is really slow. Perhaps this would be faster with a different backingstore. — Walter Verhoeven, Aug 30 '19 at 15:20

score 7 · Answer 3 · answered Jun 19 '13 at 10:02

7

Read the file on one thread, adding its lines to a blocking queue. Start N tasks reading from that queue. Set max size of the queue to prevent out of memory errors.

answered Jun 19 '13 at 10:02

Sergey Kalinichenko

714,442
84
1,110
1,523

score 5 · Answer 4 · answered Jun 19 '13 at 10:14

5

Something like:

public class ParallelReadExample
{
    public static IEnumerable LineGenerator(StreamReader sr)
    {
        while ((line = sr.ReadLine()) != null)
        {
            yield return line;
        }
    }

    static void Main()
    {
        // Display powers of 2 up to the exponent 8:
        StreamReader sr = new StreamReader("yourfile.txt")

        Parallel.ForEach(LineGenerator(sr), currentLine =>
            {
                // Do your thing with currentLine here...
            } //close lambda expression
        );

        sr.Close();
    }
}

Think it would work. (No C# compiler/IDE here)

answered Jun 19 '13 at 10:14

Daan Timmer

14,771
6
34
66

What about re-write it using thr = new Thread[j]; for (; i < j; i++) { thr[i] = new Thread(new ThreadStart(go)); thr[i].IsBackground = true; thr[i].Start(); } not Parallel.ForEach – obdgy Jun 19 '13 at 10:19
2

@obdgy: Why would you want to do that? – dtb Jun 19 '13 at 10:24
1

@obdgy what use does that have compared to Parallel.ForEach? – Daan Timmer Jun 19 '13 at 10:26
1

@obdgy using 100-300 threads has no speed benefit if you are running on a dual, quad or octa core. It might even be slower than running just 8 threads on a octa core. Putting it simple: running more threads than CPU cores will only slow down the process. – Daan Timmer Jun 21 '13 at 09:30

score 4 · Answer 5 · answered Jun 19 '13 at 10:31

4

If you want to limit the number of threads to n, the easiest way is to use AsParallel() along with WithDegreeOfParallelism(n) to limit the thread count:

string filename = "C:\\TEST\\TEST.DATA";
int n = 5;

foreach (var line in File.ReadLines(filename).AsParallel().WithDegreeOfParallelism(n))
{
    // Process line.
}

answered Jun 19 '13 at 10:31

Matthew Watson

104,400
10
158
276

If I understand `File.ReadLines()` correctly it is basically a sort of python-like generator, using Yield internally? – Daan Timmer Jun 19 '13 at 11:26
@DaanTimmer I don't know anything about Python, but File.ReadLines() just returns an IEnumerable which is implemented via `yield` – Matthew Watson Jun 19 '13 at 11:32
In that case, your answer can be consolidated to, yes :-) – Daan Timmer Jun 19 '13 at 13:59

score 2 · Answer 6 · answered Oct 06 '14 at 06:39

As @dtb mentioned above, the fastest way to read a file and then process the individual lines in a file is to: 1) do a File.ReadAllLines() into an array 2) Use a Parallel.For loop to iterate over the array.

You can read more performance benchmarks here.

The basic gist of the code you would have to write is:

string[] AllLines = File.ReadAllLines(fileName);
Parallel.For(0, AllLines.Length, x =>
{
    DoStuff(AllLines[x]);
    //whatever you need to do
});

With the introduction of bigger array sizes in .Net4, as long as you have plenty of memory, this shouldn't be an issue.

Read large txt file multithreaded?

6 Answers6

Linked

Related