8

I have a task which reads a large file line by line, does some logic with it, and returns a string I need to write to a file. The order of the output does not matter. However, when I try the code below, it stops/get really slow after reading 15-20k lines of my file.

public static Object FileLock = new Object();
...
Parallel.ForEach(System.IO.File.ReadLines(inputFile), (line, _, lineNumber) =>
{
    var output = MyComplexMethodReturnsAString(line);
    lock (FileLock)
    {
        using (var file = System.IO.File.AppendText(outputFile))
        {
            file.WriteLine(output);
        }
    }
});

Why is my program slow down after some time running? Is there a more correct way to perform this task?

justindao
  • 2,273
  • 4
  • 18
  • 34
  • Do you need the order of the output lines to correspond to the input order? If so, `Parallel.ForEach` isn't the right tool. – adv12 Feb 12 '16 at 22:37
  • No, the order of the output lines does not matter. – justindao Feb 12 '16 at 22:37
  • 1
    I'm not sure but feel the using parallel this way is creating/worsening the IO bottleneck instead of avoiding it. Unless you are doing __really expensive__ operations on those lines.. – TaW Feb 12 '16 at 22:41
  • It is a fairly expensive operation. There's a noticeable increase in speed for the first 5k lines at least. It just begins to slow down soon after that. – justindao Feb 12 '16 at 22:42
  • 2
    var file = System.IO.File.AppendText(outputFile) could be placed outside of the foreach since you are locking it. Check if this improves the performance. – Fábio Junqueira Feb 12 '16 at 22:43
  • 1
    You have a used "lock" for synchronization to ensure only one thread can write the file. This would definitely slow down the things as it limits the write operation sequentially. Multiple threads would keep waiting until the file is written by the first thread. – Abhinav Galodha Feb 12 '16 at 22:44
  • Correct. Writing to the file is limited to one thread at a time. However, the operations done on the line are very expensive, so it would still be much faster to (in theory) perform all the operations an every line at once, and then write the results to the file one at a time. If you can offer a better solution, please let me know! – justindao Feb 12 '16 at 22:47
  • 1
    Your code is the equivalent of saying that you need to mow you lawn with a pair of scissors, however, rather than do it alone ('cause that'll take forever), you get 100 of your friends to help you, but then you say they all have to share the one pair of scissors. – Enigmativity Feb 13 '16 at 00:08

2 Answers2

11

You've essentially serialized your query by having all threads try to write to the file. Instead, you should calculate what needs to be written then write them as they come at the end.

var processedLines = File.ReadLines(inputFile).AsParallel()
    .Select(l => MyComplexMethodReturnsAString(l));
File.AppendAllLines(outputFile, processedLines);

If you need to flush the data as it comes, open a stream and enable auto flushing (or flush manually):

var processedLines = File.ReadLines(inputFile).AsParallel()
    .Select(l => MyComplexMethodReturnsAString(l));
using (var output = File.AppendText(outputFile))
{
    output.AutoFlush = true;
    foreach (var processedLine in processedLines)
        output.WriteLine(processedLine);
}
Jeff Mercado
  • 129,526
  • 32
  • 251
  • 272
  • if the file is really file i'm not sure this approach is adequate, since it requires reading the whole file a first step. – Souhaieb Besbes Feb 17 '16 at 11:30
  • 1
    Not when using `File.ReadLines()`, it will give you an enumerable that will enumerate through the lines of a file as they're read. This is in contrast to using `File.ReadAllLines()` which returns an array containing all the lines of a file. _That_ reads in the whole file. – Jeff Mercado Feb 18 '16 at 21:11
6

This has to do with how Parallel.ForEach's internal load balancer works. When it sees that your threads spend a lot of time blocking, it reasons that it can speed things up by throwing more threads at the problem, leading to higher parallel overheads, contention for your FileLock and overall performance degradation.

Why is this happening? Because Parallel.ForEach is not meant for IO work.

How can you fix this? Use Parallel.ForEach for CPU work only and perform all IO outside of the parallel loop.

A quick workaround is to limit the number of threads Parallel.ForEach is allowed to enlist, by using the overload which accepts ParallelOptions, like so:

Parallel.ForEach(
    System.IO.File.ReadLines(inputFile),
    new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
    (line, _, lineNumber) =>
    {
        ...
    }
Kirill Shlenskiy
  • 9,367
  • 27
  • 39
  • I really like your answer up until the "A quick workaround...". It really seems a backward step from everything you said before it. Perhaps if you fleshed out the code it would make more sense to me. – Enigmativity Feb 13 '16 at 00:03
  • Curious: I had assumed that Environment.ProcessorCount were the natural limit to MaxDegreeOfParallelism anyway. Is that wrong? – TaW Feb 13 '16 at 04:51
  • 1
    @TaW, no, it'll go way beyond `Environment.ProcessorCount`. Here's a fiddle which shows about 1 thread per second being added unti you kill the process (I gave up after 100): https://dotnetfiddle.net/dT1eBM (needless to say, you probably shouldn't be running this on your production server) – Kirill Shlenskiy Feb 13 '16 at 05:28