I got a larger program that I can summarize like:
SequentialPart
ThreadPoolParallelized
SequentialPart
ParallelPartInQuestion
SequentialPart
This code gets called in sequence many times.
I'm using Rayon threads to parallelize the second part such like:
final_results = (0..num_txns).into_par_iter()
.filter_map(|idx| {
if !matches!(ret, None) {
return None;
}
match last_input_output.take_output(idx) {
ExecutionStatus::Success(t) => Some(t),
ExecutionStatus::SkipRest(t) => {
Some(t)
}
ExecutionStatus::Abort(err) => {
None
}
}
}).collect();
I've also done this already using parallel chunks
let interm_result: Vec<ExtrResult<E>> = (0..num_txns)
.collect::<Vec<TxnIndex>>()
.par_chunks(chunk_size)
.map(|chunk| {
Either way, I noticed that the first time this code runs, everything works as expected and I get a decent performance boost out of it.
However, on the second iteration the first parallel piece of code (ThreadPoolParallelized) runs around 20% slower every time.
So I concluded that somehow Rayon must leave something behind which has to be cleaned up afterwards resulting in this performance drop.
is there something I can do about this?
Edit: What the take_output does is:
outputs: Vec<CachePadded<ArcSwapOption<TxnOutput<T, E>>>>, // txn_idx -> output.
pub fn take_output(&self, txn_idx: TxnIndex) -> ExecutionStatus<T, Error<E>> {
let owning_ptr = self.outputs[txn_idx]
.swap(None)
.expect("Output must be recorded after execution");
if let Ok(output) = Arc::try_unwrap(owning_ptr) {
output
} else {
unreachable!("Output should be uniquely owned after execution");
}
}
TLDR; If you create a custom rayon threadpool and at the same time also use the global threadpool through function calls like par_chunks
or par_iter
there is a performance overhead to clean up the global threadpool once the custom one is called.