Mysterious drop of oneshot reciever in rust webserver

Question

I'm making my webserver in rust using warp and tokio. Here's what I'm doing :

I'm creating three tokio runtimes, and executing 3 async functions on them, all of which communicate with each other using channels. I'm doing this because I'm making a webserver for model inference, and each part is responsible for one thing (preprocessing and batching, inferring from a tf model, and the server itself).
The response handler for each request passes the data it recieves to a function running on another runtime through a mpsc transmitter, a clone of which is passed to all handlers. It also passes a oneshot Sender for the other runtime to send back the results to the response handler.

This works fine for moderate load, but under heavy load (50 threads, 100 loops), the oneshot receiver in the response handler seems to be dropped, and the server is unable to return the result.

I've attached a minimum reproducible example below:

#![allow(non_snake_case)]

#[macro_use]
extern crate log;
extern crate chrono;

use crossbeam_channel::{unbounded, Receiver, Sender};
use serde::{Deserialize, Serialize};
use std::time::Duration;
use tokio::runtime::Runtime;
use tokio::sync::oneshot;
use warp::{Filter, Rejection, Reply};

#[derive(Debug, Clone, Deserialize, Serialize)]
struct ServerResponse {
    message: String,
}
impl warp::Reply for ServerResponse {
    fn into_response(self) -> warp::reply::Response {
        warp::reply::json(&self).into_response()
    }
}

#[derive(Debug)]
struct HandlerData {
    image_id: String,
    send_results_oneshot: oneshot::Sender<ServerResponse>,
}

fn main() {
    env_logger::init();
    let (tx_batch_data, rx_batch_data) = unbounded::<Vec<HandlerData>>();

    let server_thread = std::thread::spawn(move || match Runtime::new() {
        Ok(rt) => {
            rt.block_on(server(tx_batch_data));
        }
        Err(err) => error!("Error initializing runtime for server : {}", err),
    });
    let inference_and_cleanup_thread = std::thread::spawn(move || match Runtime::new() {
        Ok(rt) => {
            rt.block_on(inference_and_cleanup(&rx_batch_data));
        }
        Err(err) => error!("Error initializing runtime for inference : {}", err),
    });
    let _ = server_thread.join();
    let _ = inference_and_cleanup_thread.join();
}

async fn server(tx_handler_data: Sender<Vec<HandlerData>>) {
    let endpoint = warp::path!("imageId" / String)
        .and(warp::any().map(move || tx_handler_data.clone()))
        .and(warp::any().map(move || oneshot::channel::<ServerResponse>()))
        .and_then(response_handler);
    warp::serve(endpoint).run(([0, 0, 0, 0], 3030)).await;
}

async fn response_handler(
    image_id: String,
    tx_handler_data: Sender<Vec<HandlerData>>,
    (send_results_oneshot, get_results): (
        oneshot::Sender<ServerResponse>,
        oneshot::Receiver<ServerResponse>,
    ),
) -> Result<impl Reply, Rejection> {
    // create a oneshot sender and reciever for getting back the results
    // send to batch and preprocess task
    tx_handler_data
        .send(vec![HandlerData {
            image_id: image_id.clone(),
            send_results_oneshot: send_results_oneshot,
        }])
        .unwrap_or_else(|e| {
            error!(
                "Error while sending the data from response handler! : {}",
                e
            )
        });
    let result: ServerResponse = get_results.await.unwrap_or_else(|e| {
        error!(
            "Error getting results from oneshot in response handler: {:?}",
            e
        );
        // dummy val
        ServerResponse {
            message: "from error handler".to_string(),
        }
    });
    Ok(result)
}

async fn inference_and_cleanup(rx_batch_data: &Receiver<Vec<HandlerData>>) {
    loop {
        let batch_received: Option<Vec<HandlerData>> = rx_batch_data.try_recv().ok();
        if let Some(batch) = batch_received {
            info!("Got a batch of size {}", batch.len());
            tokio::time::sleep(Duration::from_millis(500)).await;
            for ele in batch {
                // tokio::time::delay_for(Duration::from_millis(10)).await;
                let oneshot = ele.send_results_oneshot;
                if !oneshot.is_closed() {
                    oneshot
                        .send(ServerResponse {
                            message: "worked successfully".to_string(),
                        })
                        .unwrap_or_else(|e| {
                            error!("Error while sending back results via oneshot : {:?}", e);
                        });
                } else {
                    error!("Didn't send anything, the oneshot reciever was closed");
                }
            }
        }
    }
}

I keep getting Didn't send anything, the oneshot reciever was closed in the logs under load.

What's going on? Is it something to do with the way it's architected or about how warp is handling the requests? I'd appreciate any help.

I can't reproduce the issue with your provided example, did you manage to reproduce the issue with the example server above? If so, can you provide the client code, too? I tried spamming requests from 10 clients in parallel but didn't get the described error. If I'm reading this correctly, your `inference_and_cleanup` task only yields control when there is an actual batch coming in, otherwise it's a busy wait. You might be better off using an async channel and `await`ing messages? You might be able to parallelize the batch handling by spawning a task once a batch is received, too. — sebpuetz, Mar 29 '21 at 10:29
Why would you need three different tokio runtimes? Tokio should be perfectly capable of handling a large number of tasks on a single runtime. — user4815162342, Mar 29 '21 at 12:50
I read somewhere that it's not guaranteed that two tokio tasks will run on separate threads/ thread groups. My task needs it to be in separate threads, as i need them happening in parallel. Is there any way to guarantee that with tasks? It was for that reason that i made two separate runtimes.' — Rohan Gautam, Apr 22 '21 at 06:52
Anyhow, the problem seems to have been solved. It's nothing to do with the oneshot, but an external stakeholder had added a client timeout because of which the requests were being terminated before we expected them. Seems so silly in hindsight. Thanks for the help everyone! — Rohan Gautam, Apr 22 '21 at 06:54

Mysterious drop of oneshot reciever in rust webserver

0 Answers0