I have the following operation:
import pandas as pd
import numpy as np
def some_calc(x,y):
x = x.set_index('Cat')
y = y.set_index('Cat')
y = np.sqrt(y['data_point2'])
vec = pd.DataFrame(x['data_point1'] * y)
grid = np.random.rand(len(x),len(x))
result = vec.dot(vec.T).mul(grid).sum().sum()
return result
sample_size = 100
cats = ['a','b','c','d']
df1 = pd.DataFrame({'Cat':[cats[np.random.randint(4)] for _ in range(sample_size)],
'data_point1':np.random.rand(sample_size),
'data_point2':np.random.rand(sample_size)})
df2 = df1.groupby('Cat').sum().reset_index()
I would like to run some_calc
across each of the df2
rows using their relative data points from df1
.
The code below works well:
df2['Apply'] = df2.apply(lambda x: some_calc(x=df1[df1['Cat']==x['Cat']][['Cat','data_point1']],
y=df1[df1['Cat']==x['Cat']][['Cat','data_point2']]),axis=1)
(I reset the index in df2
because I don't know how to apply across indices.
Also, I'm passing both Cat
as the index field and data_point
as vectors to some_calc
because without an index v.dot(v.T)
will crunch the dot product into one single number. This errors with .mul()
because I need the full MxM matrix as opposed to a float value. I might be doing something wrong here though...)
I'm currently exploring how I can vectorize the above so that when sample_size
grows I will not be hampered by a slow down in the calculation.
I saw that in previous threads you can toggle raw=True
so that the input deal with np.array
as opposed to pd.Series
.
df2['ApplyRaw'] = df2.apply(lambda x: some_calc(x=df1[df1['Cat']==x['Cat']][['Cat','data_point1']],
y=df1[df1['Cat']==x['Cat']]['Cat','data_point2']),axis=1, raw=True)
However, it throws an error:
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
I tried omitting Cat
from the argument but still the same issue.
Are there any code improvements or tricks I can employ that allow me to vectorize the above?
Or do I have to amend some_calc
?