How to parallelize an IEnumerable with a slow yield (which renders PLINQ useless)?

Question

I'm having some trouble finding a way to properly parallelize the processing of an IEnumerable, where the actual generation of each item takes a considerable amount of time, so it effectively locks for a bit every call to MoveNext on the reader side.

This is my scenario:

I have a method that takes an IEnumerable<(float[], float[])> (the specific type doesn't actually matter here), and I need to compute those items, split them into batches of a fixed side, then process every batch.

Assume I already have the partition code ready (see this answer here) as well as the code to process each individual partition.

The problem is that, as I've said, yielding every value from the initial list involves some IO/CPU operations (one would typically read an image, process it and return those two matrices with), so even with:

var items = dataset.AsParallel().Partition(size).ToArray().AsParallel().Select(partition =>
{
    // Process the partitions here..
    return partition;
}).ToArray(); // Two AsParallel calls because I'm doing two selections one after the other

I get around 25% CPU usage (I have an 8-cores AMD FX-8350), because I guess it's the actual generation of the items in the first list that causes the enumeration to be slow, before even getting to the first AsParallel call.

I was thinking a possible solution would be to require the user of this method to instead provide an IEnumerable<Func<(float[], float[])>>, as that would allow my method to easily process those elements in parallel.

My question is: is this the only possible solution, or is there another way to enumerate a "locking" IEnumerable in parallel, without having this slowdown causing to each item being yielded not in parallel?

Thanks!

Edit: to clarify, I am not writing the actual code in the first IEnumerable, that's up to the user of the library in question, which will input its own IEnumerable for the library to split into batches and work on. One of the reasons why I was hoping there'd be an alternative to a Func delegate was because, on the user side, just returning a tuple would be easier and more intuitive than having to explicitly return a function that lazily computes the whole thing.

seems like you need to improve the ienumerable, not your code. — Daniel A. White, Dec 05 '17 at 19:49
@DanielA.White The `IEnumerable` is provided by the user, I'm not the one writing it. It's basically the way the user would input its training dataset to the library, and here I'm only transforming it into batches to use. I'm only assuming it would generally involve some expensive work to actually spawn the various training samples on the user side (reading a file, etc..). — Sergio0694, Dec 05 '17 at 19:52
There is no way, because IEnumerable is sequential by nature. You cannot somehow force a bunch of sequential code to execute in parallel (without modifying it of course). — Evk, Dec 05 '17 at 20:19
@Evk yeah I was expecting that. The thing is that requiring the users to manually wrap all their code for each sample in a lambda seems a bit clunky and less intuitive to use than I'd like, is there at least a better way to that on the user side? Especially because you usually can't use the implicit `var` when using functions like that, as the compiler usually can't determine the exact type to cast a lambda to (a `Func(float[,] etc..` in this case. Thanks! — Sergio0694, Dec 05 '17 at 20:23
To the user that downvoted the question, "too broad"? Really? I've provided a description of the context, a code sample and explained in detail what I'm trying to do here and what I got so far. How is that considered too broad of a question? — Sergio0694, Dec 06 '17 at 08:53
You could take a look at [this](https://stackoverflow.com/questions/62035864/design-help-for-parallel-processing-azure-blob-and-bulk-copy-to-sql-database-c/62041200#62041200 "Design help for parallel processing Azure blob and bulk copy to SQL database") answer, that studies the effects of using PLINQ for creating processing pipelines. It is tricky. — Theodor Zoulias, Feb 05 '23 at 15:54

score 0 · Accepted Answer · answered Dec 06 '17 at 09:01

I'm afraid you can't. If that initial IEnumerable is slow, there is nothing you can do as a second step, no matter how many resources in terms of parallelization and processing power you employ, to make it faster. Best case scenario is you add as little as possible. But it's still slow.

The solution would be to see if maybe the original, initial sequence can be sped up by any means.

How to parallelize an IEnumerable with a slow yield (which renders PLINQ useless)?

1 Answers1