How to integrate async data collection with threadpool data processing in Rust

Question

I'd like to improve the integration of my async data collection with my rayon data processing by overlapping the retrieval and the processing. Currently, I pull lots of pages from a web site using normal async code. Once that is complete, I do the cpu-intensive work using rayon's par_iter.

It seems like I should be able to easily overlap the processing, so that I'm not waiting for every last page before I begin the grunt work. Every page that I retrieve is independent of the others, so there is no need to wait before the conversion.

Here's what I have working currently (simplified just a bit):

use rayon::prelude::*;
use futures::{stream, StreamExt};
use reqwest::{Client, Result};


const CONCURRENT_REQUESTS: usize = usize::MAX;
const MAX_PAGE: usize = 1000;

#[tokio::main]
async fn main() {
    // get data from server
    let client = Client::new();
    let bodies: Vec<Result<String>> = stream::iter(1..MAX_PAGE+1)
        .map(|page_number| {
            let client = &client;
            async move {
                client
                    .get(format!("https://someurl?{page_number}"))
                    .send()
                    .await?
                    .text()
                    .await
            }
        })
        .buffer_unordered(CONCURRENT_REQUESTS)
        .collect()
        .await;

    // transform the data
    let mut rows: Vec<MyRow> = bodies
        .par_iter()
        .filter_map(|body| body.as_ref().ok())
        .map(|data| {
            let page = serde_json::from_str::<MyPage>(data).unwrap();
            page.rows
                .iter()
                .map(|x| Row::new(x))
                .collect::<Vec<MyRow>>()
        })
        .flatten()
        .collect();

    // do something with rows
}

does something like this suits your needs? [playground](https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=746b3727144799586f9a2c7c1952b8d7) — Bamontan, Aug 31 '22 at 20:33
It sort of does and I had considered something like this. But, it seems like there is a mismatch between the key tasks. Ideally, the I/O should be completely asynchronous with as many simultaneous requests going to the server as possible, while the processing should use the the threads and cores provided by my CPU. It seems like with your approach, either the I/O or the processing is going to be hampered. — ahenshaw, Sep 01 '22 at 00:54
I don’t really know what you are doing in the « transform the data » part cause I don’t have your code, but compared to waiting for the response to finish it is probably 100x faster. Even with this approach you will just be waiting for I/O anyway. So instead of waiting a lot then doing heavy work you are just waiting a lot. I think you underestimate how much a request is slow compared to just crushing some numbers. My take is that you are trying to optimize something that take 1% of the total time. — Bamontan, Sep 01 '22 at 06:36
That's fair and upon reflection, you are probably right. I will rework using your approach and check the numbers. Thanks! — ahenshaw, Sep 01 '22 at 15:53
In a way I kind of want to be right haha, but maybe I'm wrong. Do some benchmarks if you can and tell me the results I'm interested. For more acurate numbers try to see if you can make a local server that act like the server you are trying to access, so some I/O will be present but more consistent. — Bamontan, Sep 01 '22 at 17:56

How to integrate async data collection with threadpool data processing in Rust

0 Answers0