Why isn't parallelization providing a larger speedup?

Question

I'm trying to make a variation of a HashMap that can be extended very quickly using multithreading. I'm partitioning the data using remainders. It is working, but the speedup compared to my sequential version is surprisingly small. Here's my code:

use rustc_hash::FxHashMap;
use rayon::prelude::*;
use std::time::Instant;


fn main() {
    const NUM_SUBMAPS: usize = 1_000;

    // initialize data for serial version
    let mut data_vecs = vec![Vec::new(); NUM_SUBMAPS];
    for i in 0..100_000_000 {
        data_vecs[i % NUM_SUBMAPS].push((i, i));
    }
    let mut maps = vec![FxHashMap::default(); NUM_SUBMAPS];

    // initialize clones for parallel version
    let (data_vecs_clone, mut maps_clone) = (data_vecs.clone(), maps.clone());


    // time sequential version
    let t = Instant::now();
    maps.iter_mut().zip(data_vecs).for_each(|(submap, vec)| {
        submap.extend(vec);
    });
    println!("time in sequential version: {}", t.elapsed().as_secs_f64());
    drop(maps);


    // time parallel version
    let t = Instant::now();
    maps_clone.par_iter_mut().zip(data_vecs_clone).for_each(|(submap, vec)| {
        submap.extend(vec);
    });
    println!("time in parallel version: {}", t.elapsed().as_secs_f64());
}

And here's the output on my machine:

time in sequential version: 1.9712106999999999
time in parallel version: 0.7583539

The parallel version is faster, but the speedup is much less than I typically get with Rayon. I'm using a 16-core Ryzen 5950x, so I typically get speedups over 10x using Rayon. Why is the speedup so much smaller in this case? Is there any way to improve the parallel version to use all the CPU's cores efficiently?

Edit:

I'm on Windows, in case that makes a difference.

On my fairly old 4-core mac, I get 3.55 sec and 0.79 sec, resp. — Peter Hall, Jul 05 '21 at 01:35
The test appears to be quite allocation-heavy, so perhaps you could try a different allocator to see if that changes anything. I think jemalloc is petty efficient in a multithreaded context, but it isn't the default for every arch. — Peter Hall, Jul 05 '21 at 01:43
@PeterHall Is using jemalloc possible on Windows? I tried adding `jemallocator = "0.3.2" ` to my `Cargo.toml`, but then it wouldn't compile. It had this error: `failed to run custom build command for jemalloc-sys v0.3.2` — Daniel Giger, Jul 05 '21 at 02:11
Stupid question, but are you building with the optimizations turned on? Also, re: jemalloc what target are you building for? Jemallocator doesn't build for all windows targets. — Aiden4, Jul 05 '21 at 02:50
@Aiden4 Yes, I compiled with `--release`. For build target, I'm not quite sure what you're asking, but I guess I'm building for whatever the default target is? — Daniel Giger, Jul 05 '21 at 03:06
To get the build target, run `rustc -vV`. The default target triple is labeled `host`. If the fourth part is `msvc` rather than `gnu` then that target is unsupported by jemalloc. — Aiden4, Jul 05 '21 at 03:41
Then apparently, I was using the `msvc` target. I just switched to the `gnu` target, and it's still giving the same error. — Daniel Giger, Jul 05 '21 at 03:56
What was the full error message from the build script? Also, make sure that you actually switched targets instead of just installing the new one. — Aiden4, Jul 05 '21 at 04:32
With 1000 submaps across 16 cores (up to 32 threads), each thread will have to work on *at least* 31 different submaps. They do this one submap at a time, and changing to the next submap will result in a bunch of cache misses that will slow things down until the caches warm up again. Optimal performance would be attained when the number of submaps equals the number of threads (and when the number of threads does not exceed the available parallelism from the OS+hardware). — eggyal, Jul 05 '21 at 09:58
When I tried using 32 submaps (the number of threads Rayon is using), both versions ran slower (3.73 sec and 2.28 sec respectively). I'm guessing the larger maps might be too big to fit in cache? — Daniel Giger, Jul 05 '21 at 18:35
The answer to questions like this is usually that you're running out of memory bandwidth before you run out of cores. — Matt Timmermans, Jul 05 '21 at 23:32

Why isn't parallelization providing a larger speedup?

Edit:

0 Answers0