Why is rust's rayon taking longer with Arc>>?

Question

I am trying to parallelize my codes using the crate rayon. The process is to read a file, process it and output the processed file.

I want to take note of the result of the processing of each file such that I have an Arc<Mutex<Vec<anyhow::Result<()>>>> which I lock and push each anyhow::Result<()> resulting from the processing of one file.

fn main() {
    let (mut files, _) = utils::get_files_from_folder(input_folder)?;

    let results = Arc::new(Mutex::new(Vec::<anyhow::Result<()>>::new()));

    files.par_iter_mut().for_each(|path| {
        
        if let Some(extension) = path.extension() {
            if extension == "txt" {
                let result = redact::redact_txt_and_write_json(path, &regex_vec, &output_folder); // processing done here
                results.lock().expect("`results` cannot be locked").push(result); // lock the mutex and push done here
            } else {
                eprintln!(
                    "{}INVALID EXTENSION: {} - Not yet implemented",
                    *RED_ERROR_STRING,
                    extension.to_string_lossy(),
                );
                std::process::exit(1);
            };
            ()
        } else {
            eprintln!("{}EXTENSION not found", *RED_ERROR_STRING);
            std::process::exit(1);
        }
    }); // end of for_each
    println!(
        "{:?}", results.as_ref()
    );
    Ok(())
}

My question is: why is it apparently, that with locking, it takes longer than without locking?

With locking:

Finished dev [unoptimized + debuginfo] target(s) in 1m 34s

Without locking:

Finished dev [unoptimized + debuginfo] target(s) in 0.30s

What do you mean with *"without locking"*? Please provide code for both cases. — Finomnis, Mar 24 '23 at 09:24
When multiple threads are trying to access a mutex at the same time, they have to "wait" until it gets unlocked to be able to take over. Sometimes mutex isn't the best solution to use across threads. — kemicofa ghost, Mar 24 '23 at 09:30
Please provide [MRE]s, especially for timing claims. Your code does not even compile; you can't return an `Ok(())` from a `main()` without return value. — Finomnis, Mar 24 '23 at 09:31

Finomnis · Accepted Answer · 2023-03-24T09:56:19.613

4

Your minimal example is quite convoluted and not reproducible, so I rewrote it to demonstrate the problem you are having:

use std::{
    sync::{Arc, Mutex},
    time::Instant,
};

use rayon::prelude::{IntoParallelRefIterator, ParallelIterator};

struct SomeResult(u32);

fn do_some_processing(arg: u32) -> SomeResult {
    SomeResult(arg * 2)
}

fn main() {
    let input_data = (1..10000000).collect::<Vec<_>>();

    let t_start = Instant::now();
    let results = Arc::new(Mutex::new(Vec::<SomeResult>::new()));

    input_data.par_iter().for_each(|value| {
        let result = do_some_processing(*value);
        results
            .lock()
            .expect("`results` cannot be locked")
            .push(result); // lock the mutex and push done here
    });
    let t_end = Instant::now();

    println!("Result item count: {}", results.lock().unwrap().len());
    println!("Time taken: {} ms", (t_end - t_start).as_millis());
}

Result item count: 9999999
Time taken: 116 ms

A Mutex is not the optimal data structure here. Locking means that the parallelization you achieved through the use of rayon has to be serialized again at the point of adding the result into the Vec, which becomes a major bottleneck. Only one thread can do push at the same time, all the other threads wait inside the .lock() function for the one thread to finish, until the next thread is allowed. Basically, they are constantly waiting for each other.

rayon has a solution for this usecase. Its parallel iterators support .collect(), which is the same as putting every item into the vector, just without the lock and highly parallelized.

With .collect(), I can reduce the time in my example down from 116 ms to 3 ms.

use std::time::Instant;

use rayon::prelude::{IntoParallelRefIterator, ParallelIterator};

struct SomeResult(u32);

fn do_some_processing(arg: u32) -> SomeResult {
    SomeResult(arg * 2)
}

fn main() {
    let input_data = (1..10000000).collect::<Vec<_>>();

    let t_start = Instant::now();
    let results: Vec<SomeResult> = input_data
        .par_iter()
        .map(|value| {
            let result = do_some_processing(*value);
            result
        })
        .collect();
    let t_end = Instant::now();

    println!("Result item count: {}", results.len());
    println!("Time taken: {} ms", (t_end - t_start).as_millis());
}

Result item count: 9999999
Time taken: 3 ms

edited Mar 24 '23 at 09:56

answered Mar 24 '23 at 09:42

Finomnis

18,094
1
20
27

I sincerely apologize for my lack of reproducible codes for this question. I'll do better the next time. Thank you for spending time to write a minimum viable example that absolutely captures the essence of my question and your answer completely resolves it! Thank you! – Jim Mar 24 '23 at 09:52
So I should not be using `for_each` but instead I should use `.map()` so that I can return `anyhow::Result` type in the closure and `.collect()` 'to' the `results`. Yeah, I guess the limitation of `for_each` (i.e. closure must return unit type) forced me to use the convoluted `Arc>` initially. – Jim Mar 24 '23 at 09:56
@Jim Exactly. `.map()` is the same as `.for_each()`, just with the difference that it can propagate a value out from each iteration. – Finomnis Mar 24 '23 at 09:56
Note that even for the `Mutex` version, the `Arc` is not needed: [playground](https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=bae05a01b0edd12543755853757928b6). – Jmb Mar 24 '23 at 09:58
@Jmb You are absolutely right. Although the `Arc` is not the bottleneck in this case, so I didn't bother. – Finomnis Mar 24 '23 at 09:59
@Jmb @Finomnis Thanks both. Why is `Arc` not needed? Is it because `rayon` does not do multithreading so atomic reference counting is not needed here? – Jim Mar 24 '23 at 10:05
@Jim It's because you can pass non-mut references repeatedly into nested scopes, as long as those scopes are known to end before the original variable is destroyed. `Arc` is only needed if the place you pass it to has to actively keep the variable alive; in the `rayon` case, the compiler can prove that the variable stays alive until the end of `main()`, and `rayon`'s threads are definitely finished before the end of `main()`. (That's at least the high-level view, the low level view of how lifetimes work exactly is a little more nuanced) – Finomnis Mar 24 '23 at 10:08
The reason I talk about references here is because there is no `move` keyword on your closure, so everything it references from the outside is automatically done by reference. Otherwise you would have to `.clone()` your `Arc`, which you don't do; `Arc`s are pretty much pointless if not at least two of them exist for the same object. Otherwise you can simply use references. – Finomnis Mar 24 '23 at 10:11
Thank you for your explanation @Finomnis; just a short follow up question: 1. I happened to need to use `par_mut_iter` which sends a `&mut value` to which thread, I guess in this case, an `Arc` is still not neccessary because `.collect` will 'ensure' all references are dropped before the `main()` is dropped? – Jim Mar 24 '23 at 10:14
@Jim I think you mix up concepts here - you never needed `Arc` for the **argument** of the closure. `par_mut_iter` only has an effect on the argument, not on the variables that the closure references from the outside directly. Whether or not you use `.collect` also has no effect on the `Arc` thing - but if you use `.collect()` you don't need to reference the outside `Vec` in the first place, so I'm not sure what you are trying to say here. – Finomnis Mar 24 '23 at 10:15
1

Now I get it, because I am not moving the `Mutex` into each thread, I am only passing in a reference of `Mutex` into each thread and each thread is guaranteed to be dropped before `main()` gets drop, hence there is no need to do atomic reference counting to ensure `results` can be referred to for any thread however long they might live. – Jim Mar 24 '23 at 10:21
@Jim Exactly. That's it. – Finomnis Mar 24 '23 at 10:21
@Finomnis Sorry. Just want to be sure: all threads are dropped before `main()` is dropped because of `.collect()`? – Jim Mar 24 '23 at 10:23
@Jim No, it's because that's how `rayon` works. `rayon`'s threads are attached to the `par_iter` object, which gets dropped at the end of `main()`. `.collect()` just moves that point to an earlier time, because `.collect()` consumes that `par_iter` object, destroying it in the process. – Finomnis Mar 24 '23 at 10:24
1

@Finomnis, now I fully get it; thank you! – Jim Mar 24 '23 at 10:26
@Jim The only reason it **doesn't** work with normal [`std::thread::spawn()`](https://doc.rust-lang.org/std/thread/fn.spawn.html) is because the [`JoinHandle`](https://doc.rust-lang.org/std/thread/struct.JoinHandle.html) objects they produce do **not** actually destroy the thread if they are dropped, so the thread could outlive `main()`. That said, there is [`std::thread::scope()`](https://doc.rust-lang.org/std/thread/fn.scope.html) which allows you to spawn scoped threads that behave the same as the `rayon` versions. – Finomnis Mar 24 '23 at 10:28
@Finomnis Thank you for your follow up explanation! – Jim Mar 24 '23 at 10:47

Why is rust's rayon taking longer with Arc>>?

1 Answers1