I have a pretty simple task to accomplish in C#. The user gives me a stream representing a large binary file (tens of GB in size). The file consists of lots and lots of distinct blocks. I need to read each block; perform some CPU-intensive analysis on each block; then give the user the results -- in the correct order. In pseudocode, this code might look like this:
public IEnumerable<TResult> ReadFile(Stream inputStream) {
while(true) {
byte[] block = ReadNextBlock(stream);
if (block == null) {
break; // EOF
}
TResult result = PerformCpuIntensiveAnalysis(block);
yield return result;
}
}
This works correctly, but slowly, since it's only using one CPU core for the CPU-intensive analysis. What I'd like to do is read the blocks one by one, analyze them in parallel, then return the results to the user in the same order as the blocks were encountered in the file. Naturally, I can't read the entire file into memory, so I'd like to limit the number of blocks that I keep in the queue at any given time.
There are lots of solutions to this, and I've tried a couple; but, for some reason, I can't find any solution that significantly outperforms the naive approach:
public IEnumerable<TResult> ReadFile(Stream inputStream) {
while(true) {
var batch = new List<byte[]>();
for (int i=0; i<BATCH_SIZE; i++) {
byte[] block = ReadNextBlock(stream);
if (block == null) {
break;
}
batch.Add(block);
}
if (batch.Count == 0) {
break;
}
foreach(var result in batch
.AsParallel()
.AsOrdered()
.Select(block => PerformCpuIntensiveAnalysis(block))
.ToList()) {
yield return result;
}
}
}
I've tried TPL/Dataflows as well as the purely manual approach, and in every case, my code spends most of its time waiting for synchronization. It does outperform the serial version by about 2x, but on a machine with 8 cores, I'd expect more than that. So, what am I doing wrong ?
(I should also clarify that I'm not really using the "yield returns" generator pattern in my code, I'm just using it here for brevity.)