So you're only running about half as fast as you hoped, when scaling up to 29 parallel copies of your code?
Memory bandwidth could be an issue, with 29 copies of the same algorithm reading / writing their own memory at the same time. This is why in a case like this, it would be potentially better (but much harder) to look for parallelism within a single iteration.
Let's use video encoding as a specific example of what "one iteration" might be. For example, encoding 29 videos in parallel is like what the OP's proposing. Having x264 use 32 cores to encode one video, then repeating that for the next 28 vids, uses much less total RAM, and caches better.
In practice, maybe 2 or 3 vids in parallel, each using 10 to 16 threads, would be good, since there's a limit to how much parallelism x264 can find.
It depends on the algo, and how well it scales up with multiple threads. If not at all, or you don't have time to code it, then brute force all the way. A factor of over 10 speedup is nothing to sneeze at for basically no effort. (e.g. running a single-threaded program on different data-sets with make -j29
or GNU parallel
, or in your case using multiple threads in a single program. :)
While your code is running, you could check on CPU utilization to make sure you're keeping 29 CPU cores busy like you're trying to. You could also use a profiling tool (like Linux perf
) to investigate cache effects. If a parallel run has a lot more than 29 times the data cache misses of a single-threaded run, this would start to explain things.