0

I have a function in an inner loop that takes two arrays and combines them. To get a feel for what it's doing look at this example using lists:

a = [[1,2,3]]
b = [[4,5,6],
     [7,8,9]]

def combinearrays(a, b):
    a = a + b
    return a


def main():
    print(combinearrays(a,b))

The output would be:

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

The key thing here is that I always have the same number of columns, but I want to append rows together. Also, the values are always ints.

As an added wrinkle, I cheated and created a as a list within a list. But in reality, it might be a single dimensional array that I want to still combine into a 2D array.

I am currently doing this using Numpy in real life (i.e. not the toy problem above) and this works. But I really want to make this as fast as possible and it seems like c arrays should be faster. Obviously one problem with c arrays if I pass them as parameters, is I won't know the actual number of rows in the arrays passed. But I can always add additional parameters to pass that.

So it's not like I don't have a solution to this problem using Numpy, but I really want to know what the single fastest way to do this is in Cython. Since this is a call inside an inner loop, it's going to get called thousands of times. So every little savings is going to count big.

One obvious idea here would be to use malloc or something like that.

Bruce Nielson
  • 753
  • 8
  • 23
  • 4
    Make up your mind, are you working with Python lists, numpy arrays, or C arrays. Conversion between them isn't cheap. – hpaulj Jul 28 '19 at 16:09
  • 1
    My suspicion is you should use a list of numpy arrays (append to list is usually pretty quick) then right at the end call numpy `hstack`/`vstack` (I can never pick the right one first time...) – DavidW Jul 28 '19 at 16:38
  • @hpaulj That's really what my question is: which is the fastest for an inner loop so that I can make up my mind to work with it. It wouldn't make sense to commit to one first without trying each out. For the sake of argument, let's say I'm going with C arrays as I'm struggling to understand the best way to combine those. – Bruce Nielson Jul 28 '19 at 17:13
  • @DavidW oh yes, I forgot about vstack. That is actually a pretty good idea if I stick with numpy arrays inside a list (which is one of the ways I tried implementing it). – Bruce Nielson Jul 28 '19 at 17:14
  • One thing I've noticed is that I expected numpy to be faster than lists, but it didn't seem to be the case. That still surprises me a little. Even just passing the arrays in as numpy arrays rather than lists (then converting) was a bit slower. I tried switching to memviews, but they do no good in a situation like this (where I'm primarily just trying to combine arrays together.) – Bruce Nielson Jul 28 '19 at 17:15
  • 2
    Often people ask how to create an array iteratively. They imagine themselves creating an array row by row (read from a file or calculate). Usually we say collect them in a list and do one concatenate. Or initial a (n,m) `zeros` array, and assign row by row. Stay away from a row by row concatenate. Whether you can do better in `cython` depends, in part at least, on how you generate the rows. Even there I suspect the best thing is to initial the large blank array/memory view, and assign values. But it's even better if you create the array with a few whole-array actions, and no iteration. – hpaulj Jul 28 '19 at 18:13
  • @hpaulj that's good advice. +1. So next question. What's the best way to dynamically create an array then to follow your advice? Since I won't know size until I do len(list) so I can't do it as a declaration near the top. – Bruce Nielson Jul 28 '19 at 23:23

1 Answers1

0

While I'm not convinced this is the only option, let me recommend the simple option of building a standard Python list using append and then using np.vstack or np.concatenate at the end to build a full Numpy array.

Numpy arrays store all the data essentially contiguously in memory (this isn't 100% true for if you're taking slices, but for freshly allocated memory it's basically true). When you resize the array it may get lucky and have unallocated memory after the array and then be able to reallocate in place. However, in general this won't happen and the entire contents of the array will need to be copied to the new location. (This will likely apply for any solution you devise yourself with malloc/realloc).

Python lists are good for two reasons:

  1. They are internally a list of PyObject* (in this case to the Numpy arrays it contains). If copying is needed during the resize you are only copying the pointers to the arrays, and not the whole arrays.
  2. They are designed to handle resizing/appending intelligently by over-allocating the space needed, so that they need only re-allocate more memory occasionally. Numpy arrays could have this feature, but it's less obviously a good thing for Numpy than it is for Python lists (if you have a 10GB data array that barely fits in memory do you really want it over-allocated?)

My proposed solution uses the flexibly, easily-resized list class to build your array, and then only finalizes to the inflexible but faster Numpy array at the end, therefore (largely) getting the best of both.


A completely untested outline of the same structure using C to allocate would look like:

 from libc.stdlib cimport malloc, free, realloc

 cdef int** ptr_array = NULL
 cdef int* current_row = NULL

 # just to be able to return a numpy array
 cdef int[:,::1] out

 rows_allocated = 0
 try:
     for row in range(num_rows):
         ptr_array = realloc(ptr_array, sizeof(int*)*(row+1))

         current_row = ptr_array[r] = malloc(sizeof(int)*row_length)
         rows_allocated = row+1
         # fill in data on current_row

     # pass to numpy so we can access in Python. There are other
     # way of transfering the data to Python...
     out = np.empty((rows_allocated,row_length),dtype=int)         
     for row in range(rows_allocated):
         for n in range(row_length):
             out[row,n] = ptr_array[row][n]

     return out.base
 finally:
     # clean up memory we have allocated
     for row in range(rows_allocated):
         free(ptr_array[row])
     free(ptr_array)

This is unoptimized - a better version would over-allocate ptr_array to avoid resizing each time. Because of this I don't actually expect it to be quick, but it's meant as an indication of how to start.

DavidW
  • 29,336
  • 6
  • 55
  • 86
  • 1
    I'm highly recommending you __don't__ accept this answer! I doubt if it's the last word on the issue, but I think it's a good (and simple) starting point. – DavidW Jul 28 '19 at 17:30
  • I'm going to try this out and actually see how it affects the overall speed. I'd still be very interested in a C array solution so that I can compare. How would one go about doing this with C arrays? Does it require malloc? Or can it be done in some other way? (i.e. are they any built in functions or operations that allow me to combine C arrays together?) – Bruce Nielson Jul 28 '19 at 19:04
  • I think I'm hoping to avoid malloc so that I don't have to do manual garbage collection. :) – Bruce Nielson Jul 28 '19 at 19:10
  • Okay, I tried passing a list of numpy arrays and here is what I found. This improves speed slightly when I'm creating the arrays, but then when I evaluate you them later, it's slower. However, I also tried just creating a list of lists (only using numpy in the middle to allow use of vstack) and this slows down the creation of the array, but speeds up the evaluation of it later. I haven't been able to find a good replacement for vstack so I can just work with lists. – Bruce Nielson Jul 28 '19 at 21:45
  • Okay, I have successfully eliminated numpy all together and it's 4x faster now, though still slower than I was hoping for. (Need to get it more like 10-20x faster). Are C arrays still a possibility? – Bruce Nielson Jul 28 '19 at 23:04
  • Oh, and it's faster with lists than numpy by 4x even though I had to add a new inner loop to make it work. Numpy seems to be very slow. – Bruce Nielson Jul 28 '19 at 23:24
  • 1
    @BruceNielson I've edited in an untested and unoptimized C version. It's impossible to avoid malloc/free in the C version. I doubt it'll give you the speed up you want. – DavidW Jul 29 '19 at 08:28
  • what does out.data do? I'm not familiar with that. – Bruce Nielson Jul 31 '19 at 19:55
  • 1
    It's wrong - it should be `out.base` instead (I've edited now). The idea is to use the Cython memoryview for quick element-by-element access, but return the underlying Numpy array since this is often more useful to the caller. The `base` attribute gets whatever the underlying "viewed"object is. – DavidW Jul 31 '19 at 20:01
  • Btw, after some testing I found lists to be quite a bit faster than Numpy even when using memviews (at least for my limited purposes.) I don't know why this is, but I'm sticking with lists for now. I agree that the code you showed above is probably slower than I need anyhow. I might still give it a try some time for comparison purposes. – Bruce Nielson Jul 31 '19 at 20:12