0

Quite new to Rust and trying to tackle toy problems. Trying to write a directory traversal with only Rayon.

struct Node {
    path: PathBuf,
    files: Vec<PathBuf>,
    hashes: Vec<String>,
    folders: Vec<Box<Node>>,
}

impl Node {
    pub fn new(path: PathBuf) -> Self {
        Node {
            path: path,
            files: Vec::new(),
            hashes: Vec::new(),
            folders: Vec::new(),
        }
    }
    
    pub fn burrow(&mut self) {
        let mut contents: Vec<PathBuf> = ls_dir(&self.path);

        contents.par_iter().for_each(|item| 
                                if item.is_file() {
                                    self.files.push(*item);
                                } else if item.is_dir() {
                                    let mut new_folder = Node::new(*item);
                                    new_folder.burrow();
                                    self.folders.push(Box::new(new_folder));
                                });
        
    }
}

The errors I am receiving are

error[E0596]: cannot borrow `*self.files` as mutable, as it is a captured variable in a `Fn` closure
  --> src/main.rs:40:37
   |
40 | ...                   self.files.push(*item);
   |                       ^^^^^^^^^^^^^^^^^^^^^^ cannot borrow as mutable

error[E0507]: cannot move out of `*item` which is behind a shared reference
  --> src/main.rs:40:53
   |
40 | ...                   self.files.push(*item);
   |                                       ^^^^^ move occurs because `*item` has type `PathBuf`, which does not implement the `Copy` trait

error[E0507]: cannot move out of `*item` which is behind a shared reference
  --> src/main.rs:42:68
   |
42 | ...                   let mut new_folder = Node::new(*item);
   |                                                      ^^^^^ move occurs because `*item` has type `PathBuf`, which does not implement the `Copy` trait

error[E0596]: cannot borrow `*self.folders` as mutable, as it is a captured variable in a `Fn` closure
  --> src/main.rs:44:37
   |
44 | ...                   self.folders.push(Box::new(new_folder));
   |                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ cannot borrow as mutable

The errors are clear in that they are preventing different threads from accessing mutable memory, but I'm just not sure how to start to address the errors.

Below is the original (non-parallel) version of burrow

pub fn burrow(&mut self) {
    let mut contents: Vec<PathBuf> = ls_dir(&self.path);

    for item in contents {
        if item.is_file() {
            self.files.push(item);
        } else if item.is_dir() {
            let mut new_folder = Node::new(item);
            new_folder.burrow();
            self.folders.push(Box::new(new_folder));
        }
    }
}

2 Answers2

0

The best option in this case is to use ParallelIterator::partition_map() which allows you to turn a parallel iterator into two different collections according to some condition, which is exactly what you need to do.

Example program:

use rayon::iter::{Either, IntoParallelIterator, ParallelIterator};

fn main() {
    let input = vec!["a", "bb", "c", "dd"];

    let (chars, strings): (Vec<char>, Vec<&str>) =
        input.into_par_iter().partition_map(|s| {
            if s.len() == 1 {
                Either::Left(s.chars().next().unwrap())
            } else {
                Either::Right(s)
            }
        });

    dbg!(chars, strings);
}

If you had three different outputs, unfortunately Rayon does not support that. I haven't looked at whether it'd be possible to build using Rayon's traits, but what I would suggest as a more general (though not quite as efficient) solution is to use channels. A channel like std::sync::mpsc allows any number of threads to insert items while another thread removes them — in your case, to move them into a collection. This would not be quite as efficient as parallel collection, but in an IO-dominated problem like yours, it would not be significant.

Kevin Reid
  • 37,492
  • 13
  • 80
  • 108
  • Kevin, I really appreciate the time your spent to give that answer. Since reading it, I have been looking into the documentation for `partition_map()`. It returns `(A, B)`, where `A: Default + Send + ParallelExtend, B: Default + Send + ParallelExtend`. Initially, the recursive part of my code, `burrow()`, was changing the `Node` struct in place (that's the best way I can think to say it) through each call to `burrow`. However, since `partition_map` returns a tuple, I can vaguely see how to refactor `burrow` to account for this, but not entirely. Do you have any refactoring suggestions? – Quin Darcy Oct 19 '22 at 04:18
  • @QuinDarcy You should be able to modify it in the same way you already were, since it's still a local variable at that point. That is: `{ let mut new_folder = Node::new(*item); new_folder.burrow(); Either::Right(Box::new(new_folder)) }` – Kevin Reid Oct 19 '22 at 04:27
  • Ah hah! Could I capture the return of `partition_map`, in a variable and then check if its `Left` contains `Some(file)` or its `Right` contains `Some(folder)` and `.push()` that into `self.files` or `self.folders` respectively? – Quin Darcy Oct 19 '22 at 04:47
  • @QuinDarcy No, the idea is that `.partition_map()` gives you two *new* vectors which you can then store wherever you want. The `Either` is gone by then and you don't need to worry about it. What you describe is what you would have to do if you did a `.map().collect()` instead of using `.partition_map()` – Kevin Reid Oct 19 '22 at 15:18
  • Jeez, I hate to keep bugging you as you have already been very generous with your time, but I think I found an issue with being able to implement `.partitioin_map()`. The return of `.partition_map()` is a pair of `ParallelExtend` containers. However, `ParallelExtend` doesn't have an implementation on `PathBuf` or `Box` which is the types of `item` and `new_folder`, respectively, in the example. – Quin Darcy Oct 21 '22 at 05:47
  • @QuinDarcy That's okay. It's implemented for `Vec` — the type of the *container* you want to collect your sets into. See my example code. The output of `partition_map` is the two containers, which in your case are `files` and `folders`. – Kevin Reid Oct 21 '22 at 14:25
  • Alright, after trying it out today, it seems that `ParallelExtend` does have an implementation on `Vec`, but it throws an error when `T` is anything but a primitive. In my case I have a `Vec` and `Vec>`. At this point, I am considering refactoring the `Node` struct to make it work with `partition_map()`. Though I don't know how to get around having to use `Box`. Yet again, I am need of your guidance and to REID your reply! – Quin Darcy Oct 25 '22 at 04:49
  • @QuinDarcy There is no such restriction on `ParallelExtend`; you should be able to make a collection from any type. I suggest you post a new question with the code and error — comment threads are not really fit for this. – Kevin Reid Oct 25 '22 at 14:33
  • Understood. I have opened a new question continuing our discussion above. Again, thank you very much for your time and help! – Quin Darcy Oct 26 '22 at 02:03
0

I'm going to skip the separation of files and folders, ignore the structure, and demonstrate a simple recursive approach that gets all the files in a directory recursively:

fn burrow(dir: &Path) -> Vec<PathBuf> {
    let mut contents = vec![];

    for entry in std::fs::read_dir(dir).unwrap() {
        let entry = entry.unwrap().path();
        if entry.is_dir() {
            contents.extend(burrow(&entry));
        } else {
            contents.push(entry);
        }
    }

    contents
}

The first step if you want to use the parallel iterators from rayon, is to convert this loop into a non-parallel iterator chain. The best way to do that is with .flat_map() to flatten results that yield more than one element:

fn burrow(dir: &Path) -> Vec<PathBuf> {
    std::fs::read_dir(dir)
        .unwrap()
        .flat_map(|entry| {
            let entry = entry.unwrap().path();
            if entry.is_dir() {
                burrow(&entry)
            } else {
                vec![entry] // use a single-element Vec if not a directory
            }
        })
        .collect()
}

Then to use rayon to process this iteration in parallel is to use .par_bridge() to convert an iterator into a parallel iterator. And that's it actually:

use rayon::iter::{ParallelBridge, ParallelIterator};

fn burrow(dir: &Path) -> Vec<PathBuf> {
    std::fs::read_dir(dir)
        .unwrap()
        .par_bridge()
        .flat_map(|entry| {
            let entry = entry.unwrap().path();
            if entry.is_dir() {
                burrow(&entry)
            } else {
                vec![entry]
            }
        })
        .collect()
}

See it working on the playground. You can extend on this to collect more complex results (like folders and hashes and whatever else).

kmdreko
  • 42,554
  • 6
  • 57
  • 106