Why cython embeded plugins has higher performance in cpython interpreter than rust-c interface versions?

Question

I would like to ask some questions about the underlying principles of python interpreters, because I didn't get much useful information during my own search.

I've been using rust to write python plugins lately, this gives a significant speedup to python's cpu-intensive tasks, and it's also faster to write comparing to c. However it has one disadvantage is that, compared to the old scheme of using cython to accelerate, the call overhead of rust (I'm using pyo3) seems to be greater than that of c(I'm using cython),

For example , we got an empty python function here:

def empty_function():
    return 0

Call it a million times over in Python via a for loop and count the time, so that we can find out each single call takes about 70 nanosecond(in my pc).

And if we compile it to a cython plugin, with the same source code:

# test.pyx
cpdef unsigned int empty_function():
    return 0

The execution time will be reduced to 40 nanoseconds. Which means that we can use cython for some fine-grained embedding, and we can expect it to always execute faster than native python.

However when it comes to Rust, (Honesty speaking, I prefer to use rust for plugin development rather than cython now cause there's no need to do some weird hacking in grammar), the call time will increase to 140 nanoseconds, almost twice as much as native python. Source code as follow:

use pyo3::prelude::*;
use pyo3::wrap_pyfunction;

#[pyfunction]
fn empty_function() -> usize {
    0
}

#[pymodule]
fn testlib(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(empty_function, m)?)?;
    Ok(())
}

This means that rust is not suitable for fine-grained embedded replacement of python. If there is a task whose call time is very few and each call takes a long time, then it is perfect to use rust. However if there's a task will be called a lot in the code, then it seems not suitable for rust , cause the overhead of type conversion will take up most of the accelerated time.

I want to know if this can be solved and, more importantly, I want to know the underlying rationale for this discrepancy. Is there some kind of difference with the cpython interpreter when calling between them, like the difference between cpython and pypy when calling c plugins? Where can I get further information? Thanks.

===

Update:

Sorry guys, I didn't anticipate that my question would be ambiguous, after all, the source code for all three has been given, and using timeit to test function runtimes is an almost convention in python development.

My test code is nearly all the same with @Jmb 's code in comment, with some subtle differences that I'm using python setup.py build_ext --inplace way to build instead of bare gcc, but that should not make any difference. Anyway, thanks for supplementary.

Are you building --release? I have found as much as a 20x improvement when moving from debug to release in rust. — Tim Roberts, Mar 04 '21 at 05:09
@TimRoberts Yes, I built it in release mode. But this does not seem to shorten the call time — AdamHommer, Mar 04 '21 at 05:14
In fairness, both the Python and Cython solutions are just executing "in place", where as the rust solution is calling the Python interpreter in between. My guess is that you're being fooled by the overhead, which would be swamped in cases of real computation. — Tim Roberts, Mar 04 '21 at 05:32
To add to Tim's comment: please show how you are running the Python and Cython tests. Are you calling a Python interpreter for each function call, or are the functions calls done inside a loop? How are you timing things? Without full benchmark examples, it's hard to judge where the discrepancy is, but is does look indeed there's overhead in the Rust benchmark, but not in the other benchmarks. — 9769953, Mar 04 '21 at 09:44
@00 I very much doubt that he is calling a Python interpreter for each call: the extra overhead is only 70 nanoseconds per call, which is way too short for interpreter startup. That being said, yes the OP should give more details on their benchmark procedure, including timing code. — Jmb, Mar 04 '21 at 11:06
FWIW, I've reproduced similar results, using `timeit ("empty_function()", "from test_XXX import empty_function")` for all timings. Full code [here](http://dl.free.fr/vstlpj15d). — Jmb, Mar 04 '21 at 13:40
It might just be to do about how 0 is returned. Small integers are cached singletons in Python, and it's possible that Cython optimizes finding these singletons. Possibly not, but your example isn't necessarily as simple as it looks — DavidW, Mar 04 '21 at 15:21
Fwiw: my results are 77 ns, 55 ns and 48 ns for Rust+PyO3, native and Cython, respectively. Rust version 1.50.0, Python 3.9.2, PyO3 0.13.2 and Cython 0.29.21. I'm using `timeit.Timer(setup="def empty(); return 0", stmt="empty()").time()`, and the same but with `setup="import testlib"` for Rust+PyO3 and `seutp="import cytest"` for Cython. So the discrepancy is not as large, but Rust is still about 1.5 times slower. Note the small difference between the native and Cython results. — 9769953, Mar 04 '21 at 16:17
Fun fact: I changed the 0 to 10000 and the Cython version is now slower than native Python, and sits somewhere in between the native Python and Rust+PyO3 timings. — 9769953, Mar 04 '21 at 16:20
Thanks for supplement , I've changed the return value of all three into 65536 and get a result of :Pure Python:95 ns/call,Cython: 32 ns/call,Rust PyO3: 136 ns/call — AdamHommer, Mar 05 '21 at 02:54
I've compared the code generated by Cython for 0 and 125684 and they are identical except for the literal value used. — Jmb, Mar 05 '21 at 07:18
Here's my issue on pyo3's repo https://github.com/PyO3/pyo3/issues/1470 — AdamHommer, Mar 07 '21 at 10:21
@00 I suppose when you say "native" your referring to "pure python". Native is usually for machine code in this kind of context. Sometimes for unmanaged code, for languages like C. — dawid, Mar 08 '21 at 03:16
@olyk Yes, my context used is Python here, so pure Python if you so like. — 9769953, Mar 08 '21 at 14:23
@LeeRoermond You may want to self answer, repeating the result of the PyO3 issue here as answer, with a link to it as well. That way, future searches that land on this question see an appropriate answer (time-dated, obviously), instead of having to go to through the various comments to find a resolution. — 9769953, Mar 08 '21 at 14:24

score 4 · Accepted Answer · answered Mar 09 '21 at 06:27

As suggested in the comments, this is a self-answer.

Since the discussion in the comments section did not lead to a clear conclusion, I went to raise an issue in pyo3's repo and get response from whose main maintainer.

In short, the conclusion is that there is no fundamental difference between the plugins compiled by pyo3 or cython when cpython calling them. The current speed difference comes from the different depth of optimization.

Here is the link to the issue: https://github.com/PyO3/pyo3/issues/1470

score 4 · Answer 2 · answered Apr 06 '22 at 10:15

It's also worth noting here that compiling rust extensions with python setup.py build_ext --inplace builds them in unoptimised mode (same goes for python setup.py develop or pip install -e .).

Here's an excerpt from the output of :

Finished dev [unoptimized + debuginfo] target(s) in 0.02s

To build in "release" mode with an optimised binary, use:

pip install .

With pip install . --verbose you can see the difference:

Finished release [optimized] target(s) in 1.02s

This can make a massive difference, in my case the unoptimised build is 9x slower than the optimised build.

Why cython embeded plugins has higher performance in cpython interpreter than rust-c interface versions?

2 Answers2

Linked