3

I have a big DataFrame in pandas with three columns: 'col1' is string, 'col2' and 'col3' are numpy.int64. I need to do a groupby, then apply a custom aggregation function using apply, as follows:

pd = pandas.read_csv(...)
groups = pd.groupby('col1').apply(my_custom_function)

Each group can be seen as a numpy array with two integers columns 'col2' and 'col3'. To understand what I am doing, you can think of each row ('col2','col3') as a time interval; I am checking whether there are no intervals that are intersecting. I first sort the array by the first column, then test whether the second column value at index i is smaller than the first column value at index i + 1.

FIRST QUESTION: My idea is to use Cython to define the custom aggregate function. Is this a good idea?

I tried the following definition in a .pyx file:

cimport nump as c_np

def c_my_custom_function(my_group_df):
    cdef Py_ssize_t l = len(my_group_df.index)
    if l < 2:
        return False

    cdef c_np.int64_t[:, :] temp_array
    temp_array = my_group_df[['col2','col3']].sort(columns='col2').values
    cdef Py_ssize_t i

    for i in range(l - 1):
        if temp_array[i, 1] > temp_array[i + 1, 0]:
            return True
    return False

I also defined a version in pure Python/pandas:

def my_custom_function(my_group_df):
    l = len(my_group_df.index)
    if l < 2:
        return False

    temp_array = my_group_df[['col2', 'col3']].sort(columns='col2').values

    for i in range(l - 1):
        if temp_array[i, 1] > temp_array[i + 1, 0]:
            return True
    return False

SECOND QUESTION: I timed the two versions, and both take exactly the same time. The Cython version does not seem to speed up anything. What is happening?

BONUS QUESTION: Do you see a better way to implement this algorithm?

ali_m
  • 71,714
  • 23
  • 223
  • 298
sweeeeeet
  • 1,769
  • 4
  • 26
  • 50
  • 2
    You should profile your code to see what part takes the most time. For all I know, the botleneck could be in the `group` or `sort` calls and then cython wouldn't help. BTW, the Cython version looks right, I don't think there is a way to optimize it more (well maybe using `cpdef` in the function definition). – rth Apr 26 '15 at 09:40
  • 2
    I second @rth's comment - there's a very good chance that you're spending most of your time in the `sort` operation. `cpdef` is only faster than `def` when you are calling the function from C/Cython rather than Python. – ali_m Apr 26 '15 at 19:18

1 Answers1

1

A vector numpy test could be:

np.any(temp_array[:-1,1]>temp_array[1:,0])

Whether it does better than the python or cython iteration depends on where the True occurs, if at all. If the return is at an early step in the iteration, the iteration is clearly better. And the cython version won't have much of an advantage. Also the test step will be faster than the sort step.

But if the iteration usually steps all the way through, then the vector test will be faster than the Python iteration, and faster than the sort. It may though be slower than a properly coded cython iteration.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • I am not sure I understand your comment about "the sort step", because it is still necessary even with your version. The timing is exactly equal to the two other versions, so it is definitely the sort step which is the bottelneck.The thing is, maybe the group by operations are order-preserving and I can directly sort when I extract the data from Redshift for example. – sweeeeeet Apr 27 '15 at 05:03
  • By the way, even when I skip the "sort step" which the three versions (because the group by seems to preserve the ordering), the timing is very close. The winner is the pure python loop from 600ms. Then comes your version, and then the Cython 400ms later. Respective timings (16.5 s, 17.1s, 17.5s) – sweeeeeet Apr 27 '15 at 05:25
  • Does the function return True ? For what value of `i`? – hpaulj Apr 27 '15 at 10:56
  • I think you didn't really understand what my algo is about. As I said the sort step is inevitable. Your piece of code come right after it, and allows me not to loop over the temp_array as I previously did. But it is still necessary to sort this array in order to get the good result (remember that I try to see if, in a set of intervals, two of them intersect). – sweeeeeet Apr 28 '15 at 12:55
  • 1
    I don't mean to replace the sort. It's just that if the testing loop returns true with a small `i`, you won't see much difference between the cython loop and the python one. But if the test returns False you have loop through the whole array. – hpaulj Apr 28 '15 at 15:51