I am currently trying to speedup a python scientific code with the help of PyO3. Until now, numba's @jit has been doing wonders, but some routines are impossible to parallelize. I decided to turn to Rust and Rayon which provide the tools to efficiently write a parallel version of said routine. However, using PyO3 to replace the jited sequential function by the parallel Rust code has slowed down the simulation.
As this rust function is run repeatedly in the python code, I was wondering if the thread pool used by Rayon was spawned each time, adding a large overhead to the function.
If that is the case, would there be a way to keep the rust runtime alive with the pool between each call from the python code?
Here is the Rust function code:
pub fn gather_n(positions: &[f64], n_cell: usize) -> Array1<f64> {
let mut density = positions
.par_iter()
.fold(
|| Array1::zeros(n_cell),
|mut cells, x| {
let i = x.floor() as usize;
if i == n_cell {
cells[n_cell - 1] += 1.;
} else {
let d = x - i as f64;
cells[i] += 1. - d;
cells[i + 1] += d;
}
cells
},
)
.reduce(
|| Array1::zeros(n_cell),
|mut a, b| {
a += &b;
a
},
);
density[0] *= 2.;
density[n_cell - 1] *= 2.;
density
}
and the PyO3 wrapper:
#[pyfunction]
fn gather_n<'py>(py: Python<'py>,x: &PyArray1<f64>, n:usize) -> PyResult<&'py PyArray1<f64>> {
let x_slice = unsafe{x.as_slice()?};
Ok(gather::gather_n(x_slice,n).into_pyarray(py))
}