1

I got a larger program that I can summarize like:

SequentialPart
ThreadPoolParallelized
SequentialPart
ParallelPartInQuestion
SequentialPart

This code gets called in sequence many times.

I'm using Rayon threads to parallelize the second part such like:

                final_results = (0..num_txns).into_par_iter()
                    .filter_map(|idx| {
                        if !matches!(ret, None) {
                            return None;
                        }
                        match last_input_output.take_output(idx) {
                            ExecutionStatus::Success(t) => Some(t),
                            ExecutionStatus::SkipRest(t) => {
                                Some(t)
                            }
                            ExecutionStatus::Abort(err) => {
                                None
                            }
                        }
                    }).collect();

I've also done this already using parallel chunks

 let interm_result: Vec<ExtrResult<E>> = (0..num_txns)
                .collect::<Vec<TxnIndex>>()
                .par_chunks(chunk_size)
                .map(|chunk| {

Either way, I noticed that the first time this code runs, everything works as expected and I get a decent performance boost out of it.

However, on the second iteration the first parallel piece of code (ThreadPoolParallelized) runs around 20% slower every time.

So I concluded that somehow Rayon must leave something behind which has to be cleaned up afterwards resulting in this performance drop.

is there something I can do about this?

Edit: What the take_output does is:

    outputs: Vec<CachePadded<ArcSwapOption<TxnOutput<T, E>>>>, // txn_idx -> output.

    pub fn take_output(&self, txn_idx: TxnIndex) -> ExecutionStatus<T, Error<E>> {
        let owning_ptr = self.outputs[txn_idx]
            .swap(None)
            .expect("Output must be recorded after execution");

        if let Ok(output) = Arc::try_unwrap(owning_ptr) {
            output
        } else {
            unreachable!("Output should be uniquely owned after execution");
        }
    }

TLDR; If you create a custom rayon threadpool and at the same time also use the global threadpool through function calls like par_chunks or par_iter there is a performance overhead to clean up the global threadpool once the custom one is called.

raycons
  • 735
  • 12
  • 26
  • 1
    What leads you to the conclusion that Rayon is to blame? Couldn't it be that your own code is simply doing more work the second/third/... time round? – Thomas Sep 27 '22 at 09:50
  • At this very moment, I'm executing the very exact workload every time (for benchmarking purposes) – raycons Sep 27 '22 at 09:51
  • Fair enough. What leads you to the conclusion that it's the Rayon part in particular that is affecting the performance of the threadpool part? – Thomas Sep 27 '22 at 09:54
  • If I process this sequentially without rayon the performance of the threadpool part remains constant. – raycons Sep 27 '22 at 09:56
  • Again fair enough. To the best of my knowledge, there's nothing about Rayon itself that would cause such behaviour, so it's likely something in your own code after all. If you can create a minimal, self-contained example that exhibits this behaviour, we might be able to help you further. – Thomas Sep 27 '22 at 10:01
  • Might it be related to the Arc unwrapping in a thread? – raycons Sep 27 '22 at 10:21
  • 2
    FWIW Rayon defaults to a global threadpool, which it keeps alive to amortise repeated usages. You might want to see if creating threadpools by hand (and collecting them afterwards) changes something. The mention of a 20% hit also brings up issue 968. – Masklinn Sep 27 '22 at 10:34
  • 1
    (it might be an OS behaviour around thread priorities or affinities as well e.g. that whatever your OS is, it defaults threads to high-priority, but if the thread is idle for a sufficiently long while its priority gets scaled back under the assumption that it's something like an IO thread) – Masklinn Sep 27 '22 at 10:37
  • @Masklinn is there any doc on using a manual threadpool with the parallel iterators? – raycons Sep 27 '22 at 11:43
  • @Masklinn making the loop re-use the existing threadpool solved it. If you want you can put that up as an answer. – raycons Sep 27 '22 at 12:08
  • Well I was thinking the opposite (that reusing threadpools was causing the issues) so I won't take credit just for suggesting looking at manual pools :) I would suggest creating a reproducible test case and opening an issue @ rayon so they can investigate though (ideally with more information e.g. your hardware configuration, OS, the works) – Masklinn Sep 27 '22 at 13:01
  • Are you able to formulate the question into a reproducible example? Right now the question seems fairly disconnected from the answer. It'd be more helpful for future rayon users I think. – kmdreko Sep 27 '22 at 15:02

1 Answers1

0

I figured out what was causing the Problem. The first parallel part in this execution used a manually created threadpool. However, into_par_iter uses the global threadpool if not otherwise specified and keeps it alive for some time. This interferes with the manually created threadpool

 let interm_result: Vec<ExtrResult<E>> = RAYON_EXEC_POOL.install(|| {
                (0..num_txns)

By specifically wrapping the code that is supposed to be executed in parallel in the pool.install call it re-uses the same threadpool, doesn't create an additional one that has to be destroyed with an overhead later and preserves performance.

raycons
  • 735
  • 12
  • 26