I'm trying to make a variation of a HashMap that can be extended very quickly using multithreading. I'm partitioning the data using remainders. It is working, but the speedup compared to my sequential version is surprisingly small. Here's my code:
use rustc_hash::FxHashMap;
use rayon::prelude::*;
use std::time::Instant;
fn main() {
const NUM_SUBMAPS: usize = 1_000;
// initialize data for serial version
let mut data_vecs = vec![Vec::new(); NUM_SUBMAPS];
for i in 0..100_000_000 {
data_vecs[i % NUM_SUBMAPS].push((i, i));
}
let mut maps = vec![FxHashMap::default(); NUM_SUBMAPS];
// initialize clones for parallel version
let (data_vecs_clone, mut maps_clone) = (data_vecs.clone(), maps.clone());
// time sequential version
let t = Instant::now();
maps.iter_mut().zip(data_vecs).for_each(|(submap, vec)| {
submap.extend(vec);
});
println!("time in sequential version: {}", t.elapsed().as_secs_f64());
drop(maps);
// time parallel version
let t = Instant::now();
maps_clone.par_iter_mut().zip(data_vecs_clone).for_each(|(submap, vec)| {
submap.extend(vec);
});
println!("time in parallel version: {}", t.elapsed().as_secs_f64());
}
And here's the output on my machine:
time in sequential version: 1.9712106999999999
time in parallel version: 0.7583539
The parallel version is faster, but the speedup is much less than I typically get with Rayon. I'm using a 16-core Ryzen 5950x, so I typically get speedups over 10x using Rayon. Why is the speedup so much smaller in this case? Is there any way to improve the parallel version to use all the CPU's cores efficiently?
Edit:
I'm on Windows, in case that makes a difference.