0

Given this PLINQ code:

        public static IEnumerable<Tuple<string, string>> PlinqFileProcessingLimitedCores(int nr_of_cores) 
    {
        string archiveDirectory = @"C:\Dotnet46Examples";

        return (from file in Directory.EnumerateFiles(archiveDirectory, "*.cs", SearchOption.AllDirectories)
                from line in File.ReadLines(file).AsParallel().WithDegreeOfParallelism(nr_of_cores)
                where line.Contains("Console")
                select new Tuple<string, string>(file, line));
    }

which returns all lines of all files containing the word Console.

I tried to write faster asynch versions, however they all turned out to be significantly slower than PLINQ, e.g.:

        public static async Task<ConcurrentBag<Tuple<string, string>>> FileProcessingAsync()
    {
        string archiveDirectory = @"C:\Dotnet46Examples";
        var bag = new ConcurrentBag<Tuple<string, string>>();
        var tasks = Directory.EnumerateFiles(archiveDirectory, "*.cs", SearchOption.AllDirectories)
               .Select(file => ProcessFileAsync(bag, file));
        await Task.WhenAll(tasks);  
        return bag;
    }

        static async Task ProcessFileAsync(ConcurrentBag<Tuple<string, string>> bag, string file)
    {
        String line;
        using (StreamReader reader = File.OpenText(file))
        {
            while (reader.Peek() >= 0)
            {
                line = await reader.ReadLineAsync(); 
                if (line != null)
                {
                    if (line.Contains("Console"))
                    {
                        bag.Add(new Tuple<string, string>(file, line));
                    }
                }
            }        
        }
    }

Why is the async code so much slower (factor 1000 on my laptop)? How does a better code look like? Is the problem not suited for async? thx

Herbert Feichtinger
  • 165
  • 1
  • 2
  • 11
  • Are you running this code in a console application or a WinForms application? I am asking because of the possible implications of an installed `SynchronizationContext`. – Theodor Zoulias Apr 27 '20 at 08:43
  • 1
    Profiling I/O-bound code correctly is not easy to do, the file system cache helps too much. But it certainly highlights the big design problem with TextReader.ReadLineAsync(), never ever use it. – Hans Passant Apr 27 '20 at 13:49
  • I tested in a console app. – Herbert Feichtinger Apr 27 '20 at 21:54

1 Answers1

3

Your parallel example is (synchronously) reading the file into memory one line at a time and (in parallel) searching for the text. That's probably the fastest solution available, because often synchronous file I/O on Windows is faster than asynchronous.

I tried to write faster asynch versions

"Asynchronous" does not mean "faster". It means "does not block the calling thread". There's additional overhead with asynchronous code, so it is generally slower. The benefit of asynchronous code is not speed; it's freeing up threads. This is only a benefit if those threads have other work to do; e.g., in a server environment they could handle other requests.

There's also the problem that methods like File.OpenText don't actually allow asynchronous access, so what the ReadLineAsync is actually doing is running synchronous work on the thread pool and then treating it asynchronously. But even if you had a correct asynchronous implementation, it wouldn't be faster than reading the file synchronously.

Stephen Cleary
  • 437,863
  • 77
  • 675
  • 810
  • I do not think synchronous IO is faster than async IO since all IO is async internally in windows. A better explanation would be that the file is read in one go, rather than line by line. – JonasH Apr 27 '20 at 09:39
  • 1
    Yes, all I/O is asynchronous at the driver level. But both Win32 and .NET have synchronous implementations that require fewer allocations and CPU usage. Even if the asynchronous code was fully fixed, it would still be slower. – Stephen Cleary Apr 27 '20 at 09:42
  • Stephen the OP's code uses the `File.ReadLines`, not the `File.ReadAllLines`. – Theodor Zoulias Apr 27 '20 at 10:10
  • 1
    Thanks @TheodorZoulias, I did miss that! – Stephen Cleary Apr 27 '20 at 13:31
  • @Stephen Cleary thanks for the information, I did not know that. – JonasH Apr 27 '20 at 14:21
  • @Stephen Cleary: I misinterpreted your answer given in https://stackoverflow.com/questions/41126283/combining-plinq-with-async-method about PLINQ: "The second approach is more wasteful (in terms of thread usage), but is probably easier given what parts of your code we've seen. The second approach is to leave the I/O all synchronous and just do it in a regular PLINQ fashion" in that I thought this means async requires less elapsed time. – Herbert Feichtinger Apr 27 '20 at 21:51