Consider a Cython extension class that takes some input and then runs a method to build out a 2D array. One of the final solution's dimension is known at instantiation (based on user input), however the other dimension will change depending on various factors that are not known ahead of time. I am looking for the fastest way to grow a solution array.
This is a similar question to that discussed here and here. I put some of the suggestions discussed in those posts to the test below. I want to see if the community has any futher thoughts on this topic, particularly with Cython 3.0.0's release - are there any more recent tricks that can be employed?
Appending ndarrays to Python Lists
Example code:
import numpy as np
cdef class ListOfNdarrays:
cdef Py_ssize_t y_size, index
cdef list storage
cdef const double[::1] y0_view
cdef public double[:, ::1] solution
def __init__(self, const double[::1] y0):
self.y0_view = y0
self.y_size = self.y0_view.size
self.storage = list()
self.run()
cdef run(self):
cdef Py_ssize_t i, j
cdef double[::1] y_view
y = np.empty(self.y_size, dtype=np.float64, order='C')
y_view = y
while True:
# For testing purposes, let the loop run for a set amount of steps.
# In reality, _we won't know how many steps will be required_.
# This value 10_000 number would change based on the calculations performed below.
if self.index > 10_000:
break
for i in range(self.y_size):
# Perform some complex calculations to find y...
y_view[i] = self.y0_view[i] + (<double>self.index / (<double>i + 1.))
# Make a copy of the y array for this step. Store it in the list.
self.storage.append(y_view.copy())
self.index += 1
# Save results in a more user friendly format
solution_array = np.empty((self.y_size, self.index), dtype=np.float64, order='C')
self.solution = solution_array
for i in range(self.y_size):
for j in range(self.index):
self.solution[i, j] = self.storage[j][i]
If I run the above I find...
# Create a y0 array with 10 random values (y_size=10)
y0 = np.random.random_sample(10)
%timeit ListOfNdarrays(y0)
# Times on the order of 25ms
Use a Prebuilt np.ndarray
Instead of appending to a python list, create a numpy array with a fixed size that is based on a user-provided (or algorithmic) guess(*). If the guess turns out to be wrong, resize the array as needed.
Relevant code changes from above:
... # Other class attributes
cdef Py_ssize_t expected_size, concats
cdef double[:, ::1] storage
cdef __init__(self, const double[::1] y0, Py_ssize_t expected_size):
# Setup a storage array based on a user-defined guess on the final size.
self.concats = 1
self.expected_size = expected_size
storage_arr = np.empty((self.y_size, self.expected_size), dtype=np.float64, order='C')
self.storage = storage_arr
... # Other init steps
cdef run(self):
cdef double[:, ::1] new_storage_view
while True:
... # Run loop as normal. At the end of loop see if there is enough room.
if self.index >= (self.concats * self.expected_size):
self.concats += 1
new_storage = np.empty((self.y_size, self.expected_size * self.concats),
dtype=np.float64, order='C')
new_storage_view = new_storage
# Populate old values
for i in range(self.y_size):
for j in range(self.index):
new_storage_view[i, j] = self.storage[i, j]
# Reassign variable
self.storage = new_storage_view
# Add this step's results to storage
for i in range(self.y_size):
self.storage[i, self.index] = y_view[i]
I find a big improvement, orders of magnitude faster than appending to lists.
# Manually expanding numpy storage by fixed increments based on an initial guess
# Times if we have a perfect guess: 280us
%timeit ManuallyGrowNdarray(y0, 10_001)
# In reality we will never have a perfect guess for the size...
# Times if we overshoot by 2x: 641us
%timeit ManuallyGrowNdarray(y0, 2 * 10_001)
# Times if we undershoot by 2x: 687us
%timeit ManuallyGrowNdarray(y0, int(0.5 * 10_001))
Note: if I use np.resize, instead of building a new np.empty array during the resize step I find worse performance (823us if we undershoot by 2x)
Use C pointers and PyMem_Malloc to build the output array
Last example uses a C array and PyMem_Malloc (I tried C malloc/realloc, but did not see a noticeable difference).
Relevant changes:
... # Other definitions
cdef double* storage
cdef __init__(self, const double[::1] y0, Py_ssize_t expected_size):
# Setup a storage array based on a user-defined guess on the final size.
self.concats = 1
self.expected_size = expected_size
self.new_size = self.expected_size
self.storage = <double*>PyMem_Malloc(self.y_size * self.expected_size * sizeof(double))
if not self.storage:
raise MemoryError()
... # Other init steps
cdef run(self):
cdef double[:, ::1] new_storage_view
while True:
# Like before: run loop as normal. At the end of loop see if there is enough room.
if self.index >= (self.concats * self.expected_size):
# We need to make a larger array
self.concats += 1
self.new_size = self.concats * self.expected_size
self.storage = <double*>PyMem_Realloc(self.storage,
self.y_size * self.new_size * sizeof(double))
if not self.storage:
raise MemoryError()
# Add this step's results to storage
for i in range(self.y_size):
self.storage[self.y_size * self.index + i] = y_view[i]
# Not shown: Convert final result to a numpy array for better output experience... Make sure to free up that array's memory...
There is a noticeable performance boost if the array needs to be expanded.
# Use c pointers with C-Api Pymallocs
# Times if we have a Perfect guess: 296us
%timeit GrowCArray(y0, 10_001)
# Times if we overshoot by 2x: 472us
%timeit GrowCArray(y0, 2 * 10_001)
# Times if we undershoot by 2x: 574us
%timeit GrowCArray(y0, int(0.5 * 10_001))
Summary:
What is the fastest way to grow 2D arrays in a Cython 3.0.0+ world? These simplified experiments found that preallocating memory (with either numpy or directly in C) is the fastest. But are there better approaches?
*: Resizing by a fixed amount, as is done here, is not the best way to go about it since if you overshoot a little you just grabbed double the memory you really need. I am also curious what are good algorithms to choose a more appropriate growth size? Are there packages that can do this for work for us? As noted, np.resize()
has a big performance penalty as does list.append()
.
All examples were tested in a jupyter notebook using the compile rules: # cython: boundscheck=False, wraparound=False, nonecheck=False, cdivision=True, initializedcheck=False
Package Versions: Python 3.11; Cython 3.0.0; Numpy 1.24.3