Cythonising Pandas: ctypes for content, index and columns

Question

I am very new to Cython, yet am already experiencing extraordinary speedups just copying my .py to .pyx (and cimport cython, numpy etc) and importing into ipython3 with pyximport. Many tutorials start in this approach with the next step being to add cdef declarations for every data type, which I can do for the iterators in my for loops etc. But unlike most Pandas Cython tutorials or examples I am not apply functions so to speak, more manipulating data using slices, sums and division (etc).

So the question is: Can I increase the speed at which my code runs by stating that my DataFrame only contains floats (double), with columns that are int and rows that are int?

How to define the type of an embedded list? i.e [[int,int],[int]]

Here is an example that generates the AIC score for a partitioning of a DF, sorry it is so verbose:

    cimport cython
    import numpy as np
    cimport numpy as np
    import pandas as pd

    offcat = [
        "breakingPeace", 
        "damage", 
        "deception", 
        "kill", 
        "miscellaneous", 
        "royalOffences", 
        "sexual", 
        "theft", 
        "violentTheft"
        ]

    def partitionAIC(EmpFrame, part, OffenceEstimateFrame, ReturnDeathEstimate=False):
        """EmpFrame is DataFrame of ints, part is nested list of ints, OffenceEstimate frame is DF of float"""
        """partOf/block is a list of ints"""
        """ll, AIC,  is series/frame of floats"""
        ##Cython cdefs
        cdef int DFlen
        cdef int puns
        cdef int DeathPun    
        cdef int k
        cdef int pId
        cdef int punish

        DFlen = EmpFrame.shape[1]
        puns = 2
        DeathPun = 0
        PartitionModel = pd.DataFrame(index = EmpFrame.index, columns = EmpFrame.columns)

        for partOf in part:
            Grouping = [puns*x + y for x in partOf for y in list(range(0,puns))]
            PartGroupSum = EmpFrame.iloc[:,Grouping].sum(axis=1)

            for punish in range(0,puns):
                PunishGroup = [x*puns+punish for x in partOf]
                punishPunishment = ((EmpFrame.iloc[:,PunishGroup].sum(axis = 1) + 1/puns).div(PartGroupSum+1)).values[np.newaxis].T
                PartitionModel.iloc[:,PunishGroup] = punishPunishment
        PartitionModel = PartitionModel*OffenceEstimateFrame

        if ReturnDeathEstimate:
            DeathProbFrame = pd.DataFrame([[part]], index=EmpFrame.index, columns=['Partition'])
            for pId,block in enumerate(part):
                DeathProbFrame[pId] = PartitionModel.iloc[:,block[::puns]].sum(axis=1)
            DeathProbFrame = DeathProbFrame.apply(lambda row: sorted( [ [format("%6.5f"%row[idx])]+[offcat[X] for X in  x ] 
                for idx,x in enumerate(row['Partition'])],
                key=lambda x: x[0], reverse=True),axis=1)
        ll = (EmpFrame*np.log(PartitionModel.convert_objects(convert_numeric=True))).sum(axis=1)
        k = (len(part))*(puns-1)
        AIC = 2*k-2*ll

        if ReturnDeathEstimate:
            return AIC, DeathProbFrame
        else:
            return AIC

score 8 · Answer 1 · edited May 23 '17 at 11:47

My advice is to do as much as possible in pandas. This is kinda standard advice "get it working first, then care about performance if it really matters". So let's suppose you've done that (hopefully you've written some tests too), and it's too slow:

Profile your code. (See this SO answer, or use %prun in ipython).

The output of prun should drive what bit to improve next.

pandas (make your code more pandorable, this can help a lot).
numpy (not creating intermediary Series/DataFrames, being careful about dtypes)
cython (the last resort).

Now, if it is a line to do with slicing (it probably isn't) put that tiny part in cython, I like to remove single python function calls to cython function. On that point stuff with cython should use numpy not pandas, I don't think pandas is not going to lower to C (cython can't infer types).

Putting your entire code into cython won't actually help that much, you want to only put the specific lines, or function calls, which are performance sensitive. Keeping cython focussed is the only way to have a good time.

Read the enhancing performance section of the pandas docs*! Here this process (prun -> cythonize -> type) is gone over step-by-step with a real-life example.

*Full-disclose I wrote it that section of the docs! :)

Putting my entire code into Cython helped unbelievably! From an overnight run to 20 minutes!! Watching the CPU states, when running in Python the CPU spent a lot of time in C1+. So motivated by this, the question is more of a 'how to get blanket speedups' rather than optimisation. Thanks for the docs and other work you are doing, It was the basis of getting as far as I did 8). Does Pandas handle all the cell types and pass this on to cython? — SpmP, Apr 27 '15 at 06:02
Well, it may do, but I think you can get more effective speed up by only cythonising a small part of the code (the performance sensitive bit). Pandas does handle different dtypes when you use pandas methods (these are already vecorized or written in cython themselves). — Andy Hayden, Apr 27 '15 at 06:12
Which is to say, in response to "'how to get blanket speedups' rather than optimisation", you should be able to get *more* speedup from optimisation over blanket cythonizing. — Andy Hayden, Apr 27 '15 at 06:16

Cythonising Pandas: ctypes for content, index and columns

1 Answers1