2

I'm trying to parallelize a portion of my code, and despite it using rayon and the parallel iterators par_iter() and par_extend(), it still looks like it runs on a single thread.

I simply create a vector of i32, fill it up with a lot of values, and then move these values into a collections::HashSet of integers.

My single threaded code:

use std::collections::HashSet;

fn main() {
    let my_vec: Vec<i64> = (0..100_000_000).collect();

    let mut my_set: HashSet<i64> = HashSet::new();
    let st = std::time::Instant::now();
    my_set.extend(
        my_vec.iter().map(|x| x*(x+3)/77+44741)  // this is supposed to take a while to compute
    );
    let dur = st.elapsed();
    println!("{:?}", dur);

}

Running time is around 8.86 s in average. Here is the code using parallel iterators:

extern crate rayon;
use rayon::prelude::*;
use std::collections::HashSet;

fn main() {
    let my_vec: Vec<i64> = (0..100_000_000).collect();

    let mut my_set: HashSet<i64> = HashSet::new();
    let st = std::time::Instant::now();
    my_set.par_extend(
        my_vec.par_iter().map(|x| x*(x+3)/77+44741) // this is supposed to take a while to compute
    );
    let dur = st.elapsed();
    println!("{:?}", dur);
}

The average running time for the 'parallel' version is almost the same (8.62 s), and the cpu monitor clearly shows that a single cpu is working at 100% while the others just wait.

Do you know what I did wrong, or did not understand?

Boiethios
  • 38,438
  • 19
  • 134
  • 183
m.raynal
  • 2,983
  • 2
  • 21
  • 34
  • @FrenchBoiethios Thanks for the remark, it's edited, it looks more 'rusthonic' now. – m.raynal Jul 12 '19 at 13:33
  • I think that it works well: https://play.rust-lang.org/?version=stable&mode=release&edition=2018&gist=53958a37fb57365a9609c210aede3706 Rayon is slower, which is an expected result – Boiethios Jul 12 '19 at 13:51
  • 1
    I think that the CPU that runs at 100% is the rayon runtime. Since the calculation is faster than a context switch, the other threads have a low charge. – Boiethios Jul 12 '19 at 13:55
  • Ok .... it would be a good sign, since the objects I work with in my project are much bigger, and the expression `|x| x*(x+3)/77+44741` is actually a big search function which takes several dozens milliseconds. I'll try on my project then, thanks for the insight :) – m.raynal Jul 12 '19 at 13:56

1 Answers1

3

Your simulation is not right because your calculation is actually fast, so fast that it is faster by several orders of magnitude than a thread context switch. Your core at 100% is likely the rayon runtime, while the other cores are waiting for it.

If you actually replace your computation by a sleep, the results are as you expect:

use std::collections::HashSet;
use rayon::prelude::*; // 1.1.0
use std::time::Duration;

fn main() {
    fn slow(i: &i64) -> i64 {
        std::thread::sleep(Duration::from_millis(5));

        *i
    }

    let my_vec: Vec<i64> = (0..100).collect();

    let mut my_set: HashSet<i64> = HashSet::new();

    let st = std::time::Instant::now();
    my_set.extend(
        my_vec.iter().map(slow)  // this is supposed to take a while to compute
    );
    let dur = st.elapsed();
    println!("Regular: {:?}", dur);

    let st = std::time::Instant::now();
    my_set.par_extend(
        my_vec.par_iter().map(slow) // this is supposed to take a while to compute
    );
    let dur = st.elapsed();
    println!("Rayon: {:?}", dur);
}

Output:

Regular: 685.670791ms
Rayon: 316.733253ms

When you try to optimize your code, you must carefully benchmark it because sometimes, when you parallelize your code, that can make it slower.

Boiethios
  • 38,438
  • 19
  • 134
  • 183