3

I am experimenting with parallel and LINQ. Look at the code below. It works, but just to get the idea:

private void LoadImages(string path)
{
    images =
        Directory.GetFiles(path)
        .Select(f => GetImage(f))
        .ToList();
}

private Image GetImage(string path)
{
    return Image.FromFile(path);
}

So I am basically getting an image from each file found in the specified directory. The question is - how to make this parallel? Right now it is like iterating over them. I would like to parallelize it "somehow". Somehow, because I am too inexperienced to suggest an idea how to achieve this, so that's why I am asking you, guys, counting on some help to make this faster :)

ebvtrnog
  • 4,167
  • 4
  • 31
  • 59

2 Answers2

5

Using PLINQ:

var images=(from file in Directory.EnumerateFiles(path).AsParallel()
           select GetImage(file)).ToList();

Reading the images isn't CPU bound, so you can specify a higher degree of parallelism:

var images=(from file in Directory.EnumerateFiles(path)
                                  .AsParallel()
                                  .WithDegreeOfParallelism(16)
           select GetImage(file)).ToList();
Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236
  • I'm a bit confused by this answer. "Reading the images isn't CPU bound, so you can specify a higher degree of parallelism" How would this help? Disk access is synchronized anyway, right? So, using a higher degree of parallelism should not help unless I'm missing something. – Thash Aug 24 '18 at 11:18
  • 1
    @Thash even if disk access was synchronous (it isn't) there are *multiple* levels of caching at the disk, controller, OS level which means that the data you need may already be loaded in one of the caches. Disks batch IO commands too, to improve throughput. Finally, IO in Windows is *a*sychronous since the NT days. Synchronous API calls are *emulated* to make programming easier – Panagiotis Kanavos Sep 03 '18 at 08:37
  • 1
    @Thash that said, it doesn't mean that a DOP of 16 will result in 16x better performance. The actual performance will depend on the type of files, their size, etc. The aim is to use the disk's IO to its maximum. By reading multiple files in parallel the disk is busy loading one file while the OS handles the administrative overhead of finding and loading another one. That's why disk benchmarks use different tests for small and large files. Reading small files benefits from a high DOP while large files require a *smaller* one – Panagiotis Kanavos Sep 03 '18 at 08:47
  • thanks for clarifying! I've measured a couple of approaches. So far, I have been loading raw encoded image data synchronously and processing it on background threads using tasks. You are right that this way, it doesn't utilize the disk to its maximum. In my situation, the difference is not that big though (the processing usually takes as long as loading the file if not longer) so I think I'll leave it this way for now. – Thash Sep 03 '18 at 12:31
2

You could do something like

var images = new ConcurrentBag<Image>();

Parallel.ForEach(Directory.GetFiles(path)
.Select(f => new { img = GetImage(f) })
.Select(x => x.img), img => images.Add(img));
Eric J.
  • 147,927
  • 63
  • 340
  • 553
  • @PanagiotisKanavos: I'm not in front of a compiler right now. Feel free to edit if you find a mistake. The method signature accepts an `IEnumerable` as the first parameter and an `Action` as the second parameter. – Eric J. Mar 05 '15 at 17:42