5

I would appreciate any help from the PLYNQ experts out there! I will take time reviewing answers, I have a more established profile on math.SE.

I have an object of type ParallelQuery<List<string>>, which has 44 lists which I would like to process in parallel (five at a time, say). My process has a signature like

private ProcessResult Process(List<string> input)

The processing will return a result, which is a pair of Boolean values, as below.

    private struct ProcessResult
    {
        public ProcessResult(bool initialised, bool successful)
        {
            ProcessInitialised = initialised;
            ProcessSuccessful = successful;
        }

        public bool ProcessInitialised { get; }
        public bool ProcessSuccessful { get; }
    }

The problem. Given an IEnumerable<List<string>> processMe, my PLYNQ query tries to implement this method: https://msdn.microsoft.com/en-us/library/dd384151(v=vs.110).aspx. It is written as

processMe.AsParallel()
         .Aggregate<List<string>, ConcurrentStack<ProcessResult>, ProcessResult>
             (
                 new ConcurrentStack<ProcessResult>,   //aggregator seed
                 (agg, input) =>
                 {                         //updating the aggregate result
                     var res = Process(input);
                     agg.Push(res);
                     return agg;
                 },
                 agg => 
                 {                         //obtain the result from the aggregator agg
                     ProcessResult res;    // (in this case just the most recent result**)
                     agg.TryPop(out res);
                     return res;
                 }
             );

Unfortunately it does not run in parallel, only sequentially. (** note that this implementation doesn't make "sense", I am just trying to get the parallelisation to work for now.)


I tried a slightly different implementation, which did run in parallel, but there was no aggregation. I defined an aggregation method (which is essentially a Boolean AND on both parts of ProcessResult, i.e. aggregate([A1, A2], [B1, B2]) ≡ [A1 && B1, A2 && B2]).

private static ProcessResult AggregateProcessResults
        (ProcessResult aggregate, ProcessResult latest)
    {
        bool ini = false, suc = false;
        if (aggregate.ProcessInitialised && latest.ProcessInitialised)
            ini = true;
        if (aggregate.ProcessSuccessful && latest.ProcessSuccessful)
            suc = true;


        return new ProcessResult(ini, suc);
    }

And used the PLYNQ query https://msdn.microsoft.com/en-us/library/dd383667(v=vs.110).aspx

.Aggregate<List<string>, ProcessResult, ProcessResult>(
    new ProcessResult(true, true),
    (res, input)  => Process(input),
    (agg, latest) => AggregateProcessResults(agg, latest),
    agg           => agg

The problem here was that the AggregateProcessResults code was never hit, for some reason—I am clueless where the results were going...

Thanks for reading, any help appreciated :)

Szmagpie
  • 192
  • 2
  • 11
  • If you want to compute a new value for each item in the sequence you should use `Select`, not `Aggregate`. When you use the correct operation for the work that you're trying to do you'll find the system will be able to accomplish it much more effectively. – Servy Nov 21 '17 at 19:03
  • How many items do you have in your collection? (Only 44?) How many CPU cores do you have? Because running a query on multiple Treads and multiply CPU cores requires complex preparation. The collection has to be splitted into as many parts as many CPU cores available, run the tasks on threads, and finally aggregate the results. So .NET smart enough to not do a lot of work to make everything much slower... – Major Nov 21 '17 at 19:18
  • @Major I have 22000 strings, which is batched into 500s, giving 44 lists. I am limited to running five processes simultaneously – Szmagpie Nov 21 '17 at 20:10
  • @Servy I think I understand, but sorry if not: I can use `Select` to get a `ParallelQuery`. Although I'd still need to aggregate this afterwards, to get a single `ProcessResult`. The aggregation doesn't need to be in parallel, but surely the point of `ParallelQuery.Aggregate` is to avoid splitting this into two steps? :l – Szmagpie Nov 21 '17 at 20:25
  • @Szmagpie You're not actually aggregating your code. You're just processing each value and collecting all of the results, with no aggregation. Since you're not actually aggregating things, you shouldn't use `Aggregate`. If you were actually aggregating items, then it might make sense to use `Aggregate`. – Servy Nov 21 '17 at 20:28
  • 2
    Although even then, if you wanted to do some expensive computation to turn each value into a different value, and then aggregate those, you should *still* use `Select` to map the values and then `Aggregate` to aggregate them, rather than trying to do it all in `Aggregate`. But as it is, you are mapping the values and then not even aggregating them, so there's no reason to use `Aggregate` at all. – Servy Nov 21 '17 at 20:29
  • @Servy ahh okay, thanks, I will try to split it like that: `.AsParallel().Select(..).Aggregate(..)`. You're right that I do not currently aggregate anything, but once the parellisation is working, I plan to implement that. See the last part of my question; the logic will be something like aggregate([A1, A2], [B1, B2]) ≡ [A1 && B1, A2 && B2]. The ConcurrentStack is just a placeholder while I work on it. – Szmagpie Nov 21 '17 at 20:41

1 Answers1

4

Overload of Aggregate you use will indeed not run in parallel, by design. You pass seed, then step function, but argument to the step function (agg) is accumulator which was received from previous step. For that reason, it's inherently sequential (result of previous step is input to the next step) and not parallelizable. Not sure why this overload is included to ParallelEnumerable, but probably there was a reason.

Instead, use another overload:

var result = processMe
.AsParallel()
.Aggregate
(
    // seed factory. Each partition will call this to get its own seed
    () => new ConcurrentStack<ProcessResult>(),
    // process element and update accumulator
    (agg, input) =>
    {                                           
        var res = Process(input);
        agg.Push(res);
        return agg;
    },
    // combine accumulators from different partitions
    (agg1, agg2) => {
        agg1.PushRange(agg2.ToArray());
        return agg1;
    },
    // reduce
    agg =>
    {
        ProcessResult res;
        agg.TryPop(out res);
        return res;
    }
);
Evk
  • 98,527
  • 8
  • 141
  • 191