Rust package compiled with Maturin for use in Python causes significant runtime overhead compared to pure Rust

Question

I am currently attempting to gain a speed-up for some computationally expensive parts of my Python code by instead rewriting them in Rust with PyO3 and using Maturin to create a Python package. I noticed that the Rust implementation was more than 3 times slower than the implementation I had in Python using NumPy.

I do expect some overhead regarding moving data between the languages, but because of the large discrepancy in speed I wanted to determine if this was possibly also due to the Rust code itself. I implemented the same test case in pure Rust, and then profiled both the Python and pure Rust cases using perf and found something I don't understand.

The same calls to the internal function between the Python packaged code and the native Rust code resulted in an almost 20 times slowdown of the execution!

This can be seen by looking at the profiler for the Python packaged code, and the native Rust code. I would expect at least that the internal rust functions would run the exact same between the two scenarios because surely PyO3 is nothing more than a wrapper for the same compiled Rust code.

I do not know if this is a mistake I am making with how I am building the Rust package (i.e. not optimizing with Maturin to the same extent cargo does), or if this is an expected issue, or something else entirely. Any insight into this is appreciated.

Testing details

All code mentioned here can be found in this testing branch of my Github project. The test cases correspond to running the following (run from the root directory of the project):

Python packaged

% perf record -F9999 --call-graph dwarf python rust_test.py
% perf script -F +pid > test_py.perf

Pure Rust

% cargo build --release 
% perf record -F9999 --call-graph dwarf target/release/tpx3_toolkit
% perf script -F +pid > test_rust.perf

And these *.perf files can be found in the benchmarks/rust directory of the Github project.

The Rust package was built in the typical method for Maturin as: maturin develop

Edit

It was brought to my attention by tadman in the comments that I was not neccesarily comparing apples to oranges here with my timing comparisons, so I did some extra changes. I added internal timing to the parse function to compare it directly between the pure Rust and Python package implementations. I also am attempting this with the largest file I have access to currently in order to minimize the overhead between language and focus more on the effect of the code in the different contexts. After doing so I have observed the following results:

Python package:

Time inside Rust for i_parse: 3.83s

Pure Rust:

Time inside Rust for i_parse: 213.97ms

I repeat this as well with another file around 1/20th the size of the above file:

Python package:

Time inside Rust for i_parse: 195.56ms

Pure Rust:

Time inside Rust for i_parse: 10.64ms

Thus, there is still somewhere around a consistent 15-20 times scaling in the runtime between pure Rust and Python packaged Rust.

Are you sure the build is optimized (e.g. `--release`)? It sounds like a debug build. — tadman, May 02 '23 at 18:46
I did optimize the pure Rust build, but you're right that there is debug info in it still as I added `debug = true` under `[profile.release]` in `Cargo.toml` for the sake of getting `perf` to work right. — Brayden Freitas, May 02 '23 at 18:54
Benchmarking a debug build is pretty pointless. It's not going to perform well. You'll need to compare release vs. release to get meaningful numbers. `perf` is good for identifying hot spots if you're trying to tune it better, but it's not good for testing actual (release) speed. — tadman, May 02 '23 at 18:56
Either way, one would assume that removing the debug flag would make it better if anything, making this issue of the Python packaged implementation being significantly slower even more dramatic — Brayden Freitas, May 02 '23 at 18:58
Make a test that performs some kind of operation that takes at least 2-5 minutes, and see how that works both direct and via Python, both release builds. I think your `perf` integration is in the way, as that might impose additional overhead on the Python side, or something else, in any case interfering and a variable to remove. — tadman, May 02 '23 at 18:59
I made an edit to address your point here, which in theory I agree with. Unfortunately this change didn't mitigate the issue as the effect still is observable with a similar magnitude — Brayden Freitas, May 02 '23 at 19:29
@tadman the [`debug`](https://doc.rust-lang.org/cargo/reference/profiles.html#debug) flag only controls whether or not debug info is included and shouldn't have a large impact on performance. — drewtato, May 02 '23 at 19:45
You'll need more data points to see the root cause here. Is it scaling from zero, or is there some kind of intrinsic start-up time that penalizes Python? For example, make a run that takes at least 5-10s in Rust. If it takes 5.2 in Rust, but 8.8 in Python then it's a base overhead, but if it's still the same factor off then something deeper is wrong. It's hard to say from this data. — tadman, May 02 '23 at 21:06
In order to do a test that long I will need to pull a file off of another machine locally tomorrow, however I have tried this with the smaller files which run in 1/10 the time and the same scaling factor shows up, rather than a constant offset. I do already know the constant overhead from the wrapping of the Python code around Rust, and I have mitigated that variable in this reporting by having the timing occur in the internal Rust function (also I mitigated it originally by looking at only the internal function call in `perf`). — Brayden Freitas, May 02 '23 at 21:28
That was the issue! Thank you Ry-! I will post a response as an answer indicating this. — Brayden Freitas, May 03 '23 at 14:45

Rust package compiled with Maturin for use in Python causes significant runtime overhead compared to pure Rust

Testing details

Edit

0 Answers0