Parallel pipeline with ordered inputs and outputs

Question

I have a pretty simple task to accomplish in C#. The user gives me a stream representing a large binary file (tens of GB in size). The file consists of lots and lots of distinct blocks. I need to read each block; perform some CPU-intensive analysis on each block; then give the user the results -- in the correct order. In pseudocode, this code might look like this:

public IEnumerable<TResult> ReadFile(Stream inputStream) {
    while(true) {
        byte[] block = ReadNextBlock(stream);
        if (block == null) {
            break; // EOF
        }  
        TResult result = PerformCpuIntensiveAnalysis(block);
        yield return result;
    }
}

This works correctly, but slowly, since it's only using one CPU core for the CPU-intensive analysis. What I'd like to do is read the blocks one by one, analyze them in parallel, then return the results to the user in the same order as the blocks were encountered in the file. Naturally, I can't read the entire file into memory, so I'd like to limit the number of blocks that I keep in the queue at any given time.

There are lots of solutions to this, and I've tried a couple; but, for some reason, I can't find any solution that significantly outperforms the naive approach:

public IEnumerable<TResult> ReadFile(Stream inputStream) {
    while(true) {
        var batch = new List<byte[]>();
        for (int i=0; i<BATCH_SIZE; i++) {
            byte[] block = ReadNextBlock(stream);
            if (block == null) {
                break;
            }  
            batch.Add(block);
        }
        if (batch.Count == 0) {
            break;
        }
        foreach(var result in batch
            .AsParallel()
            .AsOrdered()
            .Select(block => PerformCpuIntensiveAnalysis(block))
            .ToList()) {
            yield return result; 
        }
    }
}

I've tried TPL/Dataflows as well as the purely manual approach, and in every case, my code spends most of its time waiting for synchronization. It does outperform the serial version by about 2x, but on a machine with 8 cores, I'd expect more than that. So, what am I doing wrong ?

(I should also clarify that I'm not really using the "yield returns" generator pattern in my code, I'm just using it here for brevity.)

Without a good [mcve] that reliably reproduces the problem, it's impossible to know for sure. But you certainly should consider the possibility that your bottleneck is I/O and not CPU. If I/O is your bottleneck, you can add threads and CPU cores 'til the cows come home and it still won't do a lick of good. — Peter Duniho, Aug 16 '16 at 23:29
@Peter Duniho: What other information do you need for the example ? In terms of I/O vs CPU, I've tried loading a smaller file (~100 MB) into a memory buffer, then wrapping it as a MemoryStream for testing -- so, I'm reasonably sure I/O is not the bottleneck. — Bugmaster, Aug 16 '16 at 23:36
Have you tried using a profiler (eg: PerfView) to try to identify where the bottleneck is occurring? — easuter, Aug 16 '16 at 23:53
@easuter: Yes, I tried the ANTS profiler as well as the VS Concurrency Visualizer. Both of them are telling me roughly the same thing: my code spends most of its time on waiting for synchronization, and not enough time on actually doing useful work. Clearly, I'm doing something wrong, but I'm not sure what... — Bugmaster, Aug 16 '16 at 23:59
Couldn't you write the results to a temporary buffer instead of directly returning them? This will probably remove the need of synchronization. Then you would not need the `AsOrdered()` either. — Nico Schertler, Aug 17 '16 at 00:09
@Nico Schertler: Sorry, can you clarify ? At the end of the day, I need to return the results to the user in the correct order; I also don't want to to spend too much extra RAM, though obviously a small amount would be ok. — Bugmaster, Aug 19 '16 at 00:19
You already have the batch list. Either create another list for the results or integrate the results in this list. Then use a `for` loop to process the items in the batch in parallel. After processing, return the results in the correct order. — Nico Schertler, Aug 19 '16 at 01:45

Olivier Jacot-Descombes · Answer 1 · 2016-08-17T00:19:57.857

1

Try to optimize the block size.

If there are too few blocks and one of them takes much longer than the others, then only one CPU will have to do almost all the work.

On the other hand, if the blocks are too small, TPL will spend a lot of time with overhead related to task management.

You should have significantly more blocks than CPU's. This allows TPL to distribute the work evenly to the CPUs. On the other hand, one block should require a significant computation work. It is hard to give concrete numbers, so.

edited Aug 17 '16 at 00:19

answered Aug 17 '16 at 00:13

Olivier Jacot-Descombes

104,806
13
138
188

Can you recommend a good way of dynamically scaling the buffer size ? I have no control over the size of the blocks; or rather, I know that each block will be about 64K in size, but I don't know how long it will take to process. But, I could batch up the blocks dynamically... – Bugmaster Aug 19 '16 at 00:20
I would just make experiments. Try a block size that is 4 times smaller or lager, and see whether you see an improvement or if gets worse. – Olivier Jacot-Descombes Aug 19 '16 at 12:49

Parallel pipeline with ordered inputs and outputs

1 Answers1