My problem is very similar to the one outlined here
Except for that my main data frame has a category column, as do my weights:
df
Out[3]:
Symbol var_1 var_2 var_3 var_4 Category
Index
1903 0.000443 0.006928 0.000000 0.012375 A
1904 -0.000690 -0.007873 0.000171 0.014824 A
1905 -0.001354 0.001545 0.000007 -0.008195 C
1906 -0.001578 0.008796 -0.000164 0.015955 D
1907 -0.001578 0.008796 -0.000164 0.015955 A
1909 -0.001354 0.001545 0.000007 -0.008195 B
wgt_df
Out[4]:
Category var_1_wgt var_2_wgt var_3_wgt var_4_wgt
0 A 0.182022 0.182022 0.131243 0.182022
1 B 0.534814 0.534814 0.534814 0.534814
2 C 0.131243 0.534814 0.131243 0.182022
3 D 0.182022 0.151921 0.151921 0.131243
What I am trying to do is apply the correct weights for each category to create a new column df['new_var'] which is the weighted sum. In the case without a category, I can convert the weights into a numpy array and use the .dot()
method which seems very fast. However, I cannot see how to do this with my problem: if I use groupby()
on the main dataframe, df, I must surely in some way do the same to my dataframe of weights, wgt_df.
In reality, df contains several million rows, and I need to repeat this calculation many times, so I am keen to find a vectorized solution; I could otherwise do df.groupby('Category')
, create a dict of dataframes whose key is the category, e.g. wgts_dict['A'] = wgts_df[wgts_df.Category == 'A']
, and apply my dot logic via lambda x
, although I am also not sure how to do this, as I'd need to explicitly refer to which group element was currently being processed in order to pull out the correct slice of wgts_df
.