14

I have a time-critical model that I wrote in Cython. The main function of my Cython extension has one loop, and according to the Cython profiler (where it shows the amount of Python calls in shades of yellow) the only 'yellow' part is currently where I'm appending to a Python list. (I have to output a Python object as I'm calling my Cython function in a Python script). This is the basic idea of my function (the rest is superfluous, I've tested every part of this function and the append operation is the bottleneck):

from libc.math cimport log
def main(some args):
    cdef (some vars)

    cdef list OutputList = []

    # NB: all vars have declared types
    for x in range(t):
        (do some Cythonic stuff, some of which uses my cimport-ed log)
        if condition is True:
            OutputList.append(x) # this is the only 'yellow' line in my main loop.
    return OutputList # return Python object to Python script that calls main()

Unfortunately, I don't know the length of my output array/list/vector (whatever I end up using). However, I could set it to 52560, which is what I end up resizing it to down the line in some other Python code. I'd like to get a major speed boost without setting the output array's length, but I will gladly toss that hope if it's holding me back.

I've also tried going with C++ in Cython to use C++ data structures (vector, queue, etc.) but doing so removes my ability to nicely cimport log. I see on the Cython documentation/wiki that you can write a 'shim' module to use pure-C functions in C++ Cython, but I have no idea how to do this and I can't find anything about how to go about that.

Anyway, I welcome all suggestions that adhere to my question:

What is the best way to build a list/array/vector of unknown size in Cython? Or is there a clear alternative (such as settling with a known-length iterable object) that makes moot my unknown-length problem?

Update

The C++ containers did show a speed increase over item assignment, and item assignment did show a speed increase over appending to lists and numpy arrays. The best method would be to use C++ containers while also being able to cimport pure-C functions...this would prevent the slow-down from having to look beyond libc.math for a quick log function.

mdscruggs
  • 1,182
  • 7
  • 15

3 Answers3

4

build1darray.pyx:

#cython: boundscheck=False, wraparound=False
from libc.math cimport log

from cython.parallel cimport prange

import numpy as pynp
cimport numpy as np

# copy declarations from libcpp.vector to allow nogil
cdef extern from "<vector>" namespace "std":
    cdef cppclass vector[T]:
        void push_back(T&) nogil
        size_t size()
        T& operator[](size_t)

def makearray(int t):
    cdef vector[np.float_t] v
    cdef int i
    with nogil: 
        for i in range(t):
            if i % 10 == 0:
                v.push_back(log(i+1))

    cdef np.ndarray[np.float_t] a = pynp.empty(v.size(), dtype=pynp.float)
    for i in prange(a.shape[0], nogil=True):
        a[i] = v[i]
    return a

The 2nd part is ~1% of the first loop therefore it doesn't make sense to optimize it for speed in this case.

<math.h> has extern "C" { ... } on my system so libc.math.log works.

PyArray_SimpleNewFromData() could be used to avoid copying data for the cost of managing memory for the array yourself.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
3

Appending python lists is a well optimized operation in CPython. Python does not allocate memory for each element, but incrementally growing arrays of Pointers to the objects in the list. So just switching to Cython will not help you very much here.

You can use c++ containers within Cython as follows:

from libc.math cimport log
from libcpp.list cimport list as cpplist

def main(int t):

    cdef cpplist[int] temp

    for x in range(t):
        if x> 0:
            temp.push_back(x)

    cdef int N = temp.size()
    cdef list OutputList = N*[0]

    for i in range(N):
        OutputList[i] = temp.front()
        temp.pop_front()

    return OutputList  

You have to test if this speeds things up, but maybe you will not gain to much speed.

Another way is to use numpy arrays. Here Cython is very good in optimizing code. So if you can live with an numpy array as return value of main, you should consider that, and replace the construction and filling of OutputList by some Cython code allocating and filling a numpy array.

For more information see http://docs.cython.org/src/tutorial/numpy.html

Ask if you need help.

UPDATE: the code should be a bit faster if you avoid method lookup in both loops:

from libc.math cimport log
from libcpp.list cimport list as cpplist

def main(int t):

    cdef cpplist[int] temp

    push_back = temp.push_back
    for x in range(t):
        if x> 0:
            push_back(x)

    cdef int N = temp.size()
    cdef list OutputList = N*[0]

    front = temp.front()
    pop_front = temp.pop_front()
    for i in range(N):
        OutputList[i] = front()
        pop_front()

    return OutputList  
rocksportrocker
  • 7,251
  • 2
  • 31
  • 48
  • My post mentioned that I tried C++ containers. I gained time on the output-building part but lost more time than I gained because of Cython's issue with using pure-C functions when in C++ mode, which killed my ability to use the cimport-ed log from libc.math. I have tried using numpy arrays to build the output, but perhaps my methods are inefficient. I ought to re-test a few methods for allocating to a numpy array (I've tried outlist = np.append(outlist, newvalue), for example, and it seemed very slow, even with declared types and the buffer interface). – mdscruggs Sep 13 '11 at 17:30
  • 3
    using np.append should be slow. Either you collect your data in some container or you count your elements first, the allocate an appropriate numpy array and then fill the array. The last step should be very fast. – rocksportrocker Sep 13 '11 at 17:47
  • @mdscruggs: please report which method gives you the most speedup. – rocksportrocker Sep 16 '11 at 12:53
  • `push_back = temp.push_back` were useless even if It would compile. `.push_back()` is not a virtual method, the address is known at compile time. – jfs Sep 23 '11 at 12:10
  • @JF: It compiles, but you are right. If you use c++ container classes there is no method lookup in the compiled code anymore. The situation is different if temp was a Python object. In this case the generated code would do a method lookup at run time. – rocksportrocker Sep 23 '11 at 15:28
  • @rocksportrocker: gcc says `argument of type ‘void (std::vector::)(const std::vector >::value_type&)’ does not match ‘void (*)(int&)’`. – jfs Sep 23 '11 at 16:18
0

What you can do is to count how many element meet your criteria and then allocate a big enough numpy array for those elements.

# pseudo code
def main(): 
   count = 0
   for i in range(t):
       if criteria: 
            count += 1

   cdef numpy.ndarray[count] result

   int idx =0
   for i in range(t):
      if criteria:
          idx += 1
          result[idx] = value
fabrizioM
  • 46,639
  • 15
  • 102
  • 119
  • 1
    Unfortunately, I'm generating new data using a non-linear model simulation, so I have no way of knowing ahead of time how long the output array will be. – mdscruggs Sep 13 '11 at 18:14