Pybind11 is slower than Pure Python

Question

I created Python Bindings using pybind11. Everything worked perfectly, but when I did a speed check test the result was disappointing.

Basically, I have a function in C++ that adds two numbers and I want to use that function from a Python script. I also included a for loop to ran 100 times to better view the difference in the processing time.

For the function "imported" from C++, using pybind11, I obtain: 0.002310514450073242 ~ 0.0034799575805664062

For the simple Python script, I obtain: 0.0012788772583007812 ~ 0.0015883445739746094

main.cpp file:

#include <pybind11/pybind11.h>
namespace py = pybind11;

double sum(double a, double b) {
    return a + b;
}

PYBIND11_MODULE(SumFunction, var) {
    var.doc() = "pybind11 example module";
    var.def("sum", &sum, "This function adds two input numbers");
}

main.py file:

from build.SumFunction import *
import time

start = time.time()
for i in range(100):
    print(sum(2.3,5.2))
end = time.time()

print(end - start)

CMakeLists.txt file:

cmake_minimum_required(VERSION 3.0.0)
project(Projectpybind11 VERSION 0.1.0)

include(CTest)
enable_testing()

add_subdirectory(pybind11)
pybind11_add_module(SumFunction main.cpp)

set(CPACK_PROJECT_NAME ${PROJECT_NAME})
set(CPACK_PROJECT_VERSION ${PROJECT_VERSION})
include(CPack)

Simple Python script:

import time

def summ(a,b):
        return a+b
start = time.time()
for i in range(100):
        print(summ(2.3,5.2))
end = time.time()

print(end - start)

There is some overhead in calling a C++ function: the arguments need to be converted from Python to C++ and the return value needs to be converted back to Python. This overhead is only worth it if you're doing significant work in C++, but not if it's just one addition. — Thomas, Aug 26 '22 at 11:22
@463035818_is_not_a_number, this was the problem.. I was on the debug mode, not release when compiling. Now I obtain values like 0.0013055801391601562 which is almost the same as Pure Python. — IceCode, Aug 26 '22 at 11:31
@Bianca You are mostly measuring the time of the `print` call, which is much slower than the time it takes to add two numbers. — Thomas, Aug 26 '22 at 11:52
This is totally in the noise, but as others have mentioned, `print` is going to overwhelm the cost of the addition. If you have real code that's impacted, please put a minimal example here otherwise you're trying to optimize a fake problem. — erip, Aug 26 '22 at 12:03
If I remove the print and only keep the sum of the two numbers, the time difference in processing increases again between Python and pybind11. Thank you @Thomas for your answers! — IceCode, Aug 26 '22 at 12:16
@erip, I don't have a real code to test the performance, I just created this simple example to test if pybind11 is working properly (to be able to import the function in Python). In the future I will do some work in C++ with some functions and I will try the binding again to see the results. Thank you! — IceCode, Aug 26 '22 at 12:16
FYI, benchmarking is a complex thing. But Python comes with a nice tool `timeit`. It allows reasonably fair time comparisons, provided you only test the relevant code. So here **please remove the `print` out of the measured code**. — Serge Ballesta, Aug 27 '22 at 08:00

score 2 · Accepted Answer · edited May 01 '23 at 13:25

Benchmarking is a very complicated thing, even can be called as a Systemic Engineering.

Because there are many processes will interference our benchmarking job. For example: NIC interrupt responsing / keyboard or mouse input / OS scheduling... I have encountered my producing process being blocked by OS for up to 15 seconds! So as the other advisors have pointed out, the print() invokes more unnecessary interference.
Your testing computation is too simple.

You must think it out clearly what are you comparing for. The speed of passing arguments between Python and C++ is obviously slower than that of within Python side. So I assume that you want to compare the computing speed of both, instead of arguments passing speed. If so, I think your computing codes are too simple, and these will lead to the time we counted is mainly the time for passing args, while the time for computing is merely the minor of the total. So, I put out my sample below, I will be glad to see anyone polish it.
Your loop count is too less.

The less loops, the more randomness. Similar with my opinion 1, testing time is merely 0.000x second. It is possible, that the running process be interferenced by OS. I think we should make the testing time to last at least a few of seconds.
C++ is not always faster than Python. Now time there are so many Python modules/libs can use GPU to execute heavy computation, and parallelly do matrix operations even only by using CPU. I guess that perhaps you are evaluating whether or not using Pybind11 in your project. I think that comparing like this worth nothing, because what is the best tool depends on what is the real requirement, but it is a good lesson to learn things. I recently encountered a case, Python is faster than C++ in a Deep Learning. Haha, funny?

At the end, I run my sample in my PC, and found that the C++ computing speed is faster up to 100 times than that in Python.

ComplexCpp.cpp:

#include <cmath>
#include <pybind11/numpy.h>
#include <pybind11/pybind11.h>

namespace py = pybind11;

double Compute( double x, py::array_t<double> ys ) {
//  std::cout << "x:" << std::setprecision( 16 ) << x << std::endl;
    auto r = ys.unchecked<1>();
    for( py::ssize_t i = 0; i < r.shape( 0 ); ++i ) {
        double y = r( i );
//      std::cout << "y:" << std::setprecision( 16 ) << y << std::endl;
        x += y;
        x *= y;
        y = std::max( y, 1.001 );
        x /= y;
        x *= std::log( y );
    }
    return x;
};

PYBIND11_MODULE( ComplexCpp, m ) {
    m.def( "Compute", &Compute, "a more complicated computing" );
};

tryComplexCpp.py

import ComplexCpp
import math
import numpy as np
import random
import time


def PyCompute(x: float, ys: np.ndarray) -> float:
    #print(f'x:{x}')
    for y in ys:
        #print(f'y:{y}')
        x += y
        x *= y
        y = max(y, 1.001)
        x /= y
        x *= math.log(y)
    return x


LOOPS: int = 100000000

if __name__ == "__main__":
    # initialize random
    x0 = random.random()

    """ We store all args in a array, then pass them into both C++ func and
        python side, to ensure that args for both sides are same. """
    args = np.ndarray(LOOPS, dtype=np.float64)
    for i in range(LOOPS):
        args[i] = random.random()

    print('Args are ready, now start...')

    # try it with C++
    start_time = time.time()
    x = ComplexCpp.Compute(x0, args)
    print(f'Computing with C++ in { time.time() - start_time }.\n')
    # forcely use the result to prevent the entire procedure be optimized(omit)
    print(f'The result is {x}\n')

    # try it with python
    start_time = time.time()
    x = PyCompute(x0, args)
    print(f'Computing with Python in { time.time() - start_time }.\n')
    # forcely use the result to prevent the entire procedure be optimized(omit)
    print(f'The result is {x}\n')

Thank you very much for this answer! I was not aware of all the interference that can take place when measuring the processing time of my code. Yes, I created this simple example because I wanted to test pybind11 and decide if it is worth integrating in my project. I will give it a chance and test it with a proper code. Thank you again! — IceCode, Aug 28 '22 at 10:35
I accepted Pybind11 into my project for about half year. My felling is that it is very conveniunt than raw Cython API. As to performance, I think it would NOT be the bottleneck, because things will be done either by python or by C++, if the designment is well. — Leon, Aug 28 '22 at 10:55

Pybind11 is slower than Pure Python

1 Answers1