2

I have the following operation:

import pandas as pd
import numpy as np

def some_calc(x,y):
    x = x.set_index('Cat')
    y = y.set_index('Cat')
    y = np.sqrt(y['data_point2'])
    vec = pd.DataFrame(x['data_point1'] * y)
    grid = np.random.rand(len(x),len(x))
    result = vec.dot(vec.T).mul(grid).sum().sum()
    return result

sample_size = 100
cats = ['a','b','c','d']

df1 = pd.DataFrame({'Cat':[cats[np.random.randint(4)] for _ in range(sample_size)],
                    'data_point1':np.random.rand(sample_size),
                    'data_point2':np.random.rand(sample_size)})

df2 = df1.groupby('Cat').sum().reset_index()

I would like to run some_calc across each of the df2 rows using their relative data points from df1.

The code below works well:

df2['Apply'] = df2.apply(lambda x: some_calc(x=df1[df1['Cat']==x['Cat']][['Cat','data_point1']], 
                                             y=df1[df1['Cat']==x['Cat']][['Cat','data_point2']]),axis=1)

(I reset the index in df2 because I don't know how to apply across indices. Also, I'm passing both Cat as the index field and data_point as vectors to some_calc because without an index v.dot(v.T) will crunch the dot product into one single number. This errors with .mul() because I need the full MxM matrix as opposed to a float value. I might be doing something wrong here though...)

I'm currently exploring how I can vectorize the above so that when sample_size grows I will not be hampered by a slow down in the calculation.

I saw that in previous threads you can toggle raw=True so that the input deal with np.array as opposed to pd.Series.

df2['ApplyRaw'] = df2.apply(lambda x: some_calc(x=df1[df1['Cat']==x['Cat']][['Cat','data_point1']], 
                                                y=df1[df1['Cat']==x['Cat']]['Cat','data_point2']),axis=1, raw=True)

However, it throws an error:

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

I tried omitting Cat from the argument but still the same issue.

Are there any code improvements or tricks I can employ that allow me to vectorize the above? Or do I have to amend some_calc?

Shaido
  • 27,497
  • 23
  • 70
  • 73
RealRageDontQuit
  • 405
  • 4
  • 17

1 Answers1

4

I'm not sure if it's possible to vectorize your function since it's a bit complex. However, some_calc itself and how it is called can be optimized.

What

df2['Apply'] = df2.apply(lambda x: some_calc(x=df1[df1['Cat']==x['Cat']][['Cat','data_point1']], 
                                             y=df1[df1['Cat']==x['Cat']][['Cat','data_point2']]),axis=1)

does is basically the same as a groupby. So instead of creating these groups and applying the function on them, use groupby + apply. Simplifying the some_calc function as well, we get:

def some_calc(df):
    x = df['data_point1'].values
    y = np.sqrt(df['data_point2'].values)
    vec = (x * y).reshape(-1, 1)
    grid = np.random.rand(len(x),len(x))
    result = (vec @ vec.T * grid).sum().sum()
    return result

apply = df1.groupby('Cat').apply(some_calc)
apply.name = 'Apply'
df2.merge(apply, left_on='Cat', right_index=True)

The final merge is just to add the results to the df2 dataframe.

Timings:

# original
20.5 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# above code
3.62 ms ± 668 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Shaido
  • 27,497
  • 23
  • 70
  • 73
  • that's a very good suggestion. However, I think I over simplified the problem. The array `grid` in `some_calc` is also an input in some edge cases. I will amend the question to reflect the above. Apologies – RealRageDontQuit Nov 23 '21 at 10:57
  • @RealRageDontQuit: I think you missed adding the edit to the question? If it's regarding extra arguments to the apply, you can see the following: https://stackoverflow.com/questions/43483365/use-pandas-groupby-apply-with-arguments – Shaido Nov 24 '21 at 02:00
  • A detailed blog post about Pandas performance: https://tomaugspurger.github.io/modern-4-performance – jsmart Nov 25 '21 at 21:20