Going faster than struct.pack with cython

Question

I'm trying to do better than struct.pack.

Taking a specific case of packing integeres, via the answer to this question, I have the following to pack a list of ints in pack_ints.pyx:

# cython: language_level=3, boundscheck=False
import cython

@cython.boundscheck(False)
@cython.wraparound(False)
def pack_ints(int_col):

    int_buf = bytearray(4*len(int_col))
    cdef int[::1] buf_view = memoryview(int_buf).cast('i')

    idx: int = 0
    for idx in range(len(int_col)):
        buf_view[idx] = int_col[idx]


    return int_buf

With this test code in ipython:

from struct import pack 
import pyximport; pyximport.install(language_level=3) 
import pack_ints 

amount = 10**7 
ints = list(range(amount)) 

res1 = pack(f'{amount}i', *ints) 
res2 = pack_ints.pack_ints(ints) 
assert(res1 == res2) 

%timeit pack(f'{amount}i', *ints)  
%timeit pack_ints.pack_ints(ints)

I get:

304 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
212 ms ± 6.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I tried to type int_buf as an array('b'), but didn't see an improvement.

Is there any other way to improve upon this, or use cython in a different way, to make this operation faster?

I doubt you can get much faster as this. Why do you think there is a potential to perform operation faster? — ead, Jan 29 '20 at 10:31
For example, I'm guessing that if someone gets a pointer to the list object via some C code, he can open threads and work in parallel, getting faster in accordance with the amount of cores. I don't know how I do this in cython, can't get to `@nogil` on this function — Jay, Jan 29 '20 at 13:45
I think you need gil to convert a python-object to a c-int (for example it could raise an exception). — ead, Jan 29 '20 at 13:54
I'd suggest a more useful thing for you to do is to fix the input data. Could you generate the packed data directly rather than wasting time generating a large list of Python objects then converting it? — DavidW, Jan 29 '20 at 16:41
This is a DB-API implementation. I actually made an option to accept numpy arrays, in which case you just have `.tobytes()`, but the normal mode is via using rows of python lists / tuples — Jay, Jan 29 '20 at 17:27
I don't want to try to do this myself but here are some hints as to how you might try what you want: you can [wrap existing extension types with `cdef extern`](https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#external-extension-types); you can find the definitions for [`PyListObject`](https://github.com/python/cpython/blob/cd7db76a636c218b2d81d3526eb435cfae61f212/Include/listobject.h#L23) and [`PyLongObject`](https://github.com/python/cpython/blob/e42b705188271da108de42b55d9344642170aa2b/Include/longintrepr.h#L85)... — DavidW, Jan 29 '20 at 20:12
... and the [implementation of `PyLong_AsLong`](https://github.com/python/cpython/blob/5428f48b6308c7fd71636077f2ebc307c9a53d03/Objects/longobject.c#L565). You'll need to adapt the implementation so that it doesn't use the GIL (i.e. can't raise exceptions). — DavidW, Jan 29 '20 at 20:15
In my testing the best way to speed this up is to use `arrays('i', ints).tobytes()`: packing 1e7 random 4-byte signed integers, struct.pack runs in 585 ms +- 10 ms while array runs in 357 ms +- 7 ms on my machine. — Masklinn, Jan 31 '20 at 10:37

ead · Accepted Answer · 2020-02-01T20:07:58.390

This answer tries to give an estimation, how much speed-up a parallelized version can yield. However, because this task is memory-bandwidth bound (Python-integer-objects take at least 32 bytes and can be scattered in memory, so there will be many cache misses) we should not expect much.

The first problem is, how to handle errors (element is not an integer or the value is too large). I will be following strategy/simplification: when object

isn't an integer,
is negative integer,
or integer is >=2^30

it will be casted to a special number (-1) which signals that something went wrong. Allowing only non-negative integers <2^30 makes my life easier, as I have to reimplement PyLong_AsLongAndOverflow whitout the raising errors and otherwise detecting overflows is often cumbersome (however, see the version at the end of the answer for a more sophisticated approach).

The memory-layout of Python's integer object can be found here:

struct _longobject {
    PyObject_VAR_HEAD
    digit ob_digit[1];
};

Member ob_size/macro Py_SIZE tells us how many 30-bit digits are used in the representation of the integer(ob_size is negative for negative integer).

My simple rule thus translates to the following C-code (I use rather C than Cython, as it is a simpler/more natural way of using Python's C-API):

#include <Python.h>

// returns -1 if vv is not an integer,
//            negative, or > 2**30-1
int to_int(PyObject *vv){ 
   if (PyLong_Check(vv)) {
       PyLongObject * v = (PyLongObject *)vv;
       Py_ssize_t i = Py_SIZE(v);
       if(i==0){
           return 0;
       }
       if(i==1){//small enought for a digit
           return v->ob_digit[0];
       }
       //negative (i<0) or too big (i>1)
       return -1;
   }
   return -1;
}

Now given a list, we can convert it to an int-buffer in parallel with the following C-function, which uses omp:

void convert_list(PyListObject *lst, int *output){
    Py_ssize_t n = Py_SIZE(lst);
    PyObject **data = lst->ob_item;
    #pragma omp parallel for
    for(Py_ssize_t i=0; i<n; ++i){
        output[i] = to_int(data[i]);
    }
}

There is not much to say - PyListObject-API is used to access the elements of the list in parallel. It can be done, because there are no ref counting/racing conditions in to_int-function.

Now, bundling it all together with Cython:

%%cython -c=-fopenmp --link-args=-fopenmp
import cython

cdef extern from *:
    """
    #include <Python.h>

    int to_int(PyObject *vv){ 
       ... code
    }

    void convert_list(PyListObject *lst, int *output){
        ... code
    }
    """
    void convert_list(list lst, int *output)

@cython.boundscheck(False)
@cython.wraparound(False)
def pack_ints_ead(list int_col):
    cdef char[::1] int_buf = bytearray(4*len(int_col))
    convert_list(int_col, <int*>(&int_buf[0]))
    return int_buf.base

One important detail is: convert_list must not be nogil (because it isn't)! Omp threads and Python-threads (which are affected by GIL) are completly different things.

One can (but there is no must) release GIL for omp-operations while using objects with buffer-protocol - because those objects get locked via buffer-protocol and cannot be changed from different Python-threads. A list has no such locking mechanism and thus, if GIL were released, the list could be changed in another threads and all our pointers could get invalidated.

So now to timings (with a slightly bigger list):

amount = 5*10**7 
ints = list(range(amount)) 


%timeit pack(f'{amount}i', *ints)  
# 1.51 s ± 38.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pack_ints_DavidW(ints) 
# 284 ms ± 3.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pack_ints_ead(ints) 
# 177 ms ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

btw turning off the parallelization for pack_ints_ead leads to running time of 209 ms.

So given the modest improvement of ca. 33%, I would opt for the more robust DavidW's solution.

Here is implementation with a slightly different way of signaling wrong values:

not an integer object results in -2147483648(i.e. 0x80000000) - the smallest negative value a 32bit-int can store.
integers >=2147483647 (i.e. >=0x7fffffff) will be mapped to/stored as 2147483647 - the biggest positive number a 32bit-int can store.
integers <=-2147483647 (i.e. <=0x80000001) will be mapped to/stored as -2147483647
all other integer are mapped on their correct value.

The main advantage is, that it works correctly for a larger range of integer-values. This algorithm yields almost the same running time (maybe 2-3% slower) as the first, simple version:

int to_int(PyObject *vv){ 
   if (PyLong_Check(vv)) {
       PyLongObject * v = (PyLongObject *)vv;
       Py_ssize_t i = Py_SIZE(v);
       int sign = i<0 ? -1 : 1;
       i = abs(i);
       if(i==0){
           return 0;
       }
       if(i==1){//small enought for a digit
           return sign*v->ob_digit[0];
       }
       if(i==2 && (v->ob_digit[1]>>1)==0){
           int add = (v->ob_digit[1]&1) << 30;
           return sign*(v->ob_digit[0]+add);
       }
       return sign * 0x7fffffff;
   }
   return 0x80000000;
}

Thanks! Does this C code get compiled by cython? Also, if I convert the list to a tuple with `tuple()`, can I then use `nogil`? — Jay, Feb 01 '20 at 10:50
@Jay The C code is incorporated into the c file Cython generates and compiled at the same time. Converting to `tuple` makes no difference to `nogil`, but `nogil` doesn't matter here - the loop is run in parallel like you want (none of the threads need the GIL since they are only reading and not changing, but one of the threads should have the GIL so nothing else can run in parallel and change things - I agree it's a little confusing...) — DavidW, Feb 01 '20 at 11:08
@Jay I use https://cython.readthedocs.io/en/latest/src/userguide/external_C_code.html#including-verbatim-c-code to include C-code verbatim. You can use nogil for objects which implement buffer-protocol (https://docs.python.org/3/c-api/buffer.html) neither tuple nor list do. However, you don't need to as DavidW has pointed out - the calculation are already happinging in the parallel. — ead, Feb 01 '20 at 20:05

DavidW · Answer 2 · 2020-01-31T14:06:51.043

2

When I run my code from the original question I get a ~5 times speed-up.

When I run your code here I see the results you report plus an important warning at the compile stage that I think you're ignoring:

warning: pack_ints.pyx:13:17: Index should be typed for more efficient access

I'm not sure why it isn't picking up the type correctly but to fix it you should change the definition of i back to the code I originally wrote:

cdef int i
# not "i: int"

Hopefully someone else will come along and try something cleverer, because it's obvious a bit ridiculous that this is an answer.

edited Jan 31 '20 at 14:06

answered Jan 31 '20 at 13:55

DavidW

29,336
6
55
86

I believe this line was white when I checked the annotation, so I didn't concern myself with the warning. I will definitely swap to cdef if it makes the leap to X5 – Jay Feb 01 '20 at 10:52
1

@Jay Changing that line makes the loop yellow in the annotated html. The line itself stays white – DavidW Feb 01 '20 at 11:12

Going faster than struct.pack with cython

2 Answers2

Linked