I have a big DataFrame
in pandas with three columns: 'col1'
is string, 'col2'
and 'col3'
are numpy.int64
. I need to do a groupby
, then apply a custom aggregation function using apply
, as follows:
pd = pandas.read_csv(...)
groups = pd.groupby('col1').apply(my_custom_function)
Each group can be seen as a numpy array with two integers columns 'col2'
and 'col3'
. To understand what I am doing, you can think of each row ('col2','col3')
as a time interval; I am checking whether there are no intervals that are intersecting. I first sort the array by the first column, then test whether the second column value at index i
is smaller than the first column value at index i + 1
.
FIRST QUESTION: My idea is to use Cython to define the custom aggregate function. Is this a good idea?
I tried the following definition in a .pyx
file:
cimport nump as c_np
def c_my_custom_function(my_group_df):
cdef Py_ssize_t l = len(my_group_df.index)
if l < 2:
return False
cdef c_np.int64_t[:, :] temp_array
temp_array = my_group_df[['col2','col3']].sort(columns='col2').values
cdef Py_ssize_t i
for i in range(l - 1):
if temp_array[i, 1] > temp_array[i + 1, 0]:
return True
return False
I also defined a version in pure Python/pandas:
def my_custom_function(my_group_df):
l = len(my_group_df.index)
if l < 2:
return False
temp_array = my_group_df[['col2', 'col3']].sort(columns='col2').values
for i in range(l - 1):
if temp_array[i, 1] > temp_array[i + 1, 0]:
return True
return False
SECOND QUESTION: I timed the two versions, and both take exactly the same time. The Cython version does not seem to speed up anything. What is happening?
BONUS QUESTION: Do you see a better way to implement this algorithm?