1

In an attempt to speed up struct.pack(), I have the following to pack an int to bytes:

import cython as c
from cython import nogil, compile, returns, locals, cfunc, pointer, address

int_bytes_buffer = c.declare(c.char[400], [0] * 400)


@locals(i = c.int, num = c.int)
@returns(c.int)
@cfunc
@nogil
@compile
def int_to_bytes(num):
    i = 0
    while num >0:
        int_bytes_buffer[i] = num%256
        num//=256
        i+=1

    return int_bytes_buffer[0]


int_to_bytes(259)

I'm trying to get this to work on a list of ints, with the following bad code:

@locals(i = c.int, ints_p = pointer(c.int[100]), num = c.int)
@returns(c.int)
@cfunc
@nogil
@compile
def int_to_bytes(num):
    i = 0
    for num in ints_p:
        while num >0:
            int_bytes_buffer[i] = num%256
            num//=256
            i+=1

    return int_bytes_buffer[0]

ints = c.declare(c.int[100],  [259]*100)
int_to_bytes(address(ints))

which gives me:

    for num in ints_p:
              ^
----------------------------------------------------------

 Accessing Python global or builtin not allowed without gil

Evidently I shouldn't be using in, or looping over a pointer.

How can I loop over the list-made-array inside the function?

EDIT:

I'm trying to pass a pointer to an array of ints to the function, and have it work without the gil so it can be parallelized.

The parameter to the function should've been ints_p:

@locals(ints_p = pointer(c.int[100]), i = c.int, num = c.int)
@returns(c.int)
@cfunc
@nogil
@compile
def int_to_bytes(ints_p):
    i = 0
    for num in (*ints_p):
        while num >0:
            int_bytes_buffer[i] = num%256
            num//=256
            i+=1

    return int_bytes_buffer[0]

ints = c.declare(c.int[100],  [259]*100)
int_to_bytes(address(ints))

and I want to run over the actual ints and pack them (without the gil)

EDIT 2:

I am aware of struct.pack. I wish to make a parallelizeable variant with cython and nogil.

Jay
  • 2,535
  • 3
  • 32
  • 44
  • If you looked on CPython sources, you would found that struct.unpack/pack functions are already implemented in C. It is unlikely that using of Cython improves its performance. – bubble Nov 25 '19 at 09:43
  • I want a nogil variant – Jay Nov 25 '19 at 10:35

2 Answers2

4

This is pointless:

  1. A Python int can be arbitrarily big. The actual computational work in "packing" it is working out if it fits in a given size and then copying it to a space of that size. However, you're using an array of C ints. These have a fixed size. There is basically no work to be done in extracting them into an array of bytes. All you have done is written a very inefficient version of memcpy. They are literally already in memory as a contiguous set of bytes - all you have to do is view them as such:

    # using Numpy (no Cython)
    ints = np.array([1,2,3,4,5,6,7], dtype=np.int) # some numpy array already initialized
    as_bytes = ints.view(dtype=np.byte) # no data is copied - wonderfully efficient
    

    you could make a similar approach work with another array library or with C arrays too:

    # slightly pointless use of pure-Python mode since this won't
    # be valid in Python.
    @cython.cfunc
    @cython.returns(cython.p_char)
    @cython.locals(x = cython.p_int)
    def cast_ptr(x):
        return cython.cast(cython.p_char,x)
    
  2. You say you want nogil so it can be parallelized. Parallelization works well when there's actual computational work to be done. It doesn't work well when the task is limited by memory access, since the threads tend to end up waiting for each other for access to the memory. This task will not parallelize well.

  3. Memory management is a problem. You're only capable of writing into fixed-size buffers. To allocate variable-sized arrays you have a number of choices: you can use numpy or the Python array module (or similar) to let Python take care of the memory-management or you can use malloc and free to allocate arrays on a C level. Since you claim to need nogil you have to use the C approach. However, you cannot do this from Cython's pure-Python mode since everything also has to work in Python and there is no Python equivalent of malloc and free. If you insist on trying to make this work then you need to abandon Cython's pure-Python mode and use the standard Cython syntax since what you are trying to do cannot be made compatible with both.

    Note that currently int_bytes_buffer is a global array. This means that multiple threads will share it - a disaster for your supposed parallelization.


You need to think clearly what your inputs are going to be. If it's a list of Python ints then you cannot make this work with nogil (since you are manipulating Python objects and this requires the GIL). If it's some C-level array (be it Numpy, the array module, or a Cython declared C array) then your data is already in the format you want and you just have to view it as such.


Edit: From the comments this is clearly an X-Y problem (you're asking about fixing this Cython syntax because you want to pack a list of ints) I've added a fast way of packing a list of Python ints using Cython. This is 7x faster than struct pack and 5x faster than passing a list to array.array. It's mostly faster because it's specialized to only do one thing.

I've used bytearray as a convenient writeable data store and the Python memoryview class (not quite the same as the Cython memoryview syntax...) as a way to cast the data-types. No real effort has been spent optimising it so you may be able to improve it. Note that the copy into bytes at the end does not change the time measurable, illustrating just how irrelevant copying the memory is to the overall speed.

@cython.boundscheck(False)
@cython.wraparound(False)
def packlist(a):
    out = bytearray(4*len(a))
    cdef int[::1] outview = memoryview(out).cast('i')
    cdef int i
    for i in range(len(a)):
        outview[i] = a[i]
    return bytes(out)
DavidW
  • 29,336
  • 6
  • 55
  • 86
  • Thanks for the answer. I wish to start with a list of Python ints, and end up with their packed form, just the way struct.pack works. I saw an improvement when using struct.pack with multiprocessing, and figured "ungilling" this operation can yield the benefit of not going through pickling / unpickling. – Jay Nov 25 '19 at 17:52
  • 1
    My view is: using Cython (probably mixed with Python C API calls) you could write a specialized version of `struct.pack` that only deals with Python integers that go to 32 bit ints, and this would probably beat `struct.pack` but it wouldn't be parallel. But I don't think the bit of the algorithm that _this_ question focuses on has much to be gained – DavidW Nov 25 '19 at 18:33
  • Thanks for the sample! Could you add your testing code? Did you use @compile? – Jay Nov 26 '19 at 15:16
  • 1
    @Jay The code as given is written in normal Cython (i.e. a .pyx file) rather than "pure-Python" mode which you're using, therefore it doesn't need a `@compile` decorator. Just put it in a .pyx file and use `cythonize -i filename.pyx` to compile it (or setup.py, or pyximport - there's lots of options). You could translate it to "pure-Python" mode pretty easily if you wanted but I don't think there's much point - it'll be painfully slow without Cython so there's no point in making it runnable in Python alone. – DavidW Nov 26 '19 at 15:37
  • 1
    It was compared against `def struct_pack(a): return struct.pack("i"*len(a), *a)` and `def make_array(a): return array.array("i",a).tobytes()`. The length of `a` doesn't really affect the relative performance too much. I know `"i"*len(a)` looks bad, but it actually doesn't slow things. I don't think I have the testing code to hand, but it was just using `timeit` – DavidW Nov 26 '19 at 15:41
  • Is it beneficial to type `a` in your example somehow, so cython knows in advance it's a list of ints? – Jay Jan 21 '20 at 18:54
  • 1
    You could certainly type it as a `list`. There's no way of specifying list of ints. I doubt it makes much difference. – DavidW Jan 21 '20 at 19:36
  • You mentioned there is room to improve performance. How may one go about improving it further? – Jay Jan 26 '20 at 21:08
  • @Jay I mentioned that I hadn't spent any effort on improving performance. That's different from saying there's room to improve it. There's nothing that looks immediately obvious to do. – DavidW Jan 27 '20 at 08:54
0

There are a few errors in your code.

  1. In the error Accessing Python global or builtin not allowed without gil, so you need to remove the tag of @nogil. After you remove that, it will not show the error. Tested in my code. But there are other errors.

  2. Your function has a few problems. def int_to_bytes(num): You shouldn't pass num in the function since the value num will be assigned in the for loop. I remove it as def int_to_bytes(): and the function works. But there's still error.

    @locals(i = c.int, ints_p = c.int(5), num = c.int)
    @returns(c.int)
    @cfunc
    @compile

    def int_to_bytes():
        ints_p = [1,2,3,4,5]
        i = 0
        for num in ints_p:
            while num >0:
                int_bytes_buffer[i] = num%256
                num//=256
                i+=1

        return int_bytes_buffer[1]

    a = int_to_bytes()
    print(a)
  1. Last, I don't understand why you pass the address to the function, since the function shouldn't take anything.

The code works for me:

import cython as c
from cython import nogil, compile, returns, locals, cfunc, pointer, address

int_bytes_buffer = c.declare(c.char[400], [0] * 400)

ints = c.declare(c.int[100],  [259]*100)
# for i in list(*address(ints)):
#   print(i)
@locals(i = c.int, num = c.int)
@returns(c.int)
@cfunc
@compile

def int_to_bytes(values):
    i = 0
    for num in list(*address(values)):
        while num >0:
            int_bytes_buffer[i] = num%256
            num//=256
            i+=1

    return int_bytes_buffer

a = int_to_bytes(ints)
print([i for i in a])

Hope it helps.

Bill Chen
  • 1,699
  • 14
  • 24
  • Hi, thanks for the answer. What I'm looking to accomplish is to pass a pointer to an int array to the function (should've passed ints_p) and have it work without the gil, so I can use it multi threaded. – Jay Nov 25 '19 at 04:46
  • I think it should be similiar, are you sure the address is continuous in the memory, if so, you should add 1, and use `address()` to extract the value. – Bill Chen Nov 25 '19 at 05:57