1

My problem is very similar to the one outlined here

Except for that my main data frame has a category column, as do my weights:

df
Out[3]: 
Symbol         var_1      var_2     var_3     var_4    Category  
Index                                              
1903          0.000443  0.006928  0.000000  0.012375      A
1904         -0.000690 -0.007873  0.000171  0.014824      A
1905         -0.001354  0.001545  0.000007 -0.008195      C
1906         -0.001578  0.008796 -0.000164  0.015955      D
1907         -0.001578  0.008796 -0.000164  0.015955      A
1909         -0.001354  0.001545  0.000007 -0.008195      B


wgt_df
Out[4]: 
  Category   var_1_wgt var_2_wgt var_3_wgt var_4_wgt
0    A       0.182022   0.182022  0.131243  0.182022
1    B       0.534814   0.534814  0.534814  0.534814
2    C       0.131243   0.534814  0.131243  0.182022  
3    D       0.182022   0.151921  0.151921  0.131243

What I am trying to do is apply the correct weights for each category to create a new column df['new_var'] which is the weighted sum. In the case without a category, I can convert the weights into a numpy array and use the .dot() method which seems very fast. However, I cannot see how to do this with my problem: if I use groupby() on the main dataframe, df, I must surely in some way do the same to my dataframe of weights, wgt_df.

In reality, df contains several million rows, and I need to repeat this calculation many times, so I am keen to find a vectorized solution; I could otherwise do df.groupby('Category'), create a dict of dataframes whose key is the category, e.g. wgts_dict['A'] = wgts_df[wgts_df.Category == 'A'], and apply my dot logic via lambda x, although I am also not sure how to do this, as I'd need to explicitly refer to which group element was currently being processed in order to pull out the correct slice of wgts_df.

Community
  • 1
  • 1
Carl
  • 598
  • 2
  • 11
  • 25
  • You could join the data frames `joined = df.join(df_wgt.set_index('Category'), on='Category')` and then do the multiplication column-wise `joined['var_1'] * joined['var_1_wgt']`. I'm not sure how fast that is though. – emulbreh May 12 '17 at 20:49

2 Answers2

1

Setup

print(df)
Out[655]: 
           var_1     var_2     var_3     var_4 Category
Symbol                                                 
1903    0.000443  0.006928  0.000000  0.012375        A
1904   -0.000690 -0.007873  0.000171  0.014824        A
1905   -0.001354  0.001545  0.000007 -0.008195        C
1906   -0.001578  0.008796 -0.000164  0.015955        D
1907   -0.001578  0.008796 -0.000164  0.015955        A
1909   -0.001354  0.001545  0.000007 -0.008195        B

print(w)
Out[656]: 
  Category  var_1_wgt  var_2_wgt  var_3_wgt  var_4_wgt
0        A   0.182022   0.182022   0.131243   0.182022
1        B   0.534814   0.534814   0.534814   0.534814
2        C   0.131243   0.534814   0.131243   0.182022
3        D   0.182022   0.151921   0.151921   0.131243

Solution

#convert Category to numerical encoding
df['C_Number'] = df.Category.apply(lambda x: ord(x.lower())-97)

#Get a dot product for each row with all category weights and the extract the weights by the category number

df['new_var'] = ((df.iloc[:,:4].values).dot(w.iloc[:,-4:].values))[np.arange(len(df)),df.C_Number]

Out[654]: 
           var_1     var_2     var_3     var_4 Category  C_Number   new_var
Symbol                                                                     
1903    0.000443  0.006928  0.000000  0.012375        A         0  0.006038
1904   -0.000690 -0.007873  0.000171  0.014824        A         0 -0.001615
1905   -0.001354  0.001545  0.000007 -0.008195        C         2 -0.000595
1906   -0.001578  0.008796 -0.000164  0.015955        D         3  0.006481
1907   -0.001578  0.008796 -0.000164  0.015955        A         0  0.007300
1909   -0.001354  0.001545  0.000007 -0.008195        B         1 -0.000661
Allen Qin
  • 19,507
  • 8
  • 51
  • 67
1

You can do a groupby (select by category) and then do the dot() or you can do the dot() and then select by category. The latter is faster and simpler in pandas. Note that the data I used matches the column names in the data and the weights frames.

Code for dot() and then select:

df['dot'] = df[df_wgt.columns].dot(df_wgt.T).lookup(df.index, df.Category)

Steps performed...

  1. Select the columns to use with df[df_wgt.columns]

    This uses the column labels and ordering from the weight dataframe. This is important because dot() needs the data to be in the correct order.

  2. Performing the dot product against the transposed weights dataframe with .dot(df_wgt.T)

    Transposing the weight puts them in the correct orientation for the .dot(). This does the calculation for all of the weight categories for each row of data. That means in this case we do four times as many multiplications as will be needed, but it is still likely faster then doing grouping.

  3. Select the needed dot product with .lookup(df.index, df.Category)

    By using lookup() we can gather the correct result for the category of that row.

Code for select (groupby) and then dot():

def dot(group):
    category = group['Category'].iloc[0]
    weights = df_wgt.loc[category].values
    return pd.Series(
        np.dot(group[df_wgt.columns].values, weights), index=group.index)

df['dot'] = df.groupby(['Category']).apply(dot) \
    .reset_index().set_index('Index')[0]

Test Code:

import pandas as pd
from io import StringIO

df = pd.read_fwf(StringIO(u"""
    Index          var_1      var_2     var_3     var_4    Category
    1903          0.000443  0.006928  0.000000  0.012375      A
    1904         -0.000690 -0.007873  0.000171  0.014824      A
    1905         -0.001354  0.001545  0.000007 -0.008195      C
    1906         -0.001578  0.008796 -0.000164  0.015955      D
    1907         -0.001578  0.008796 -0.000164  0.015955      A
    1909         -0.001354  0.001545  0.000007 -0.008195      B"""),
                 header=1, skiprows=0).set_index(['Index'])

df_wgt = pd.read_fwf(StringIO(u"""
     Category     var_1      var_2     var_3     var_4
        A       0.182022   0.182022  0.131243  0.182022
        B       0.534814   0.534814  0.534814  0.534814
        C       0.131243   0.534814  0.131243  0.182022
        D       0.182022   0.151921  0.151921  0.131243"""),
                 header=1, skiprows=0).set_index(['Category'])

df['dot'] = df[df_wgt.columns].dot(df_wgt.T).lookup(df.index, df.Category)
print(df)

Results:

          var_1     var_2     var_3     var_4 Category       dot
Index                                                           
1903   0.000443  0.006928  0.000000  0.012375        A  0.003594
1904  -0.000690 -0.007873  0.000171  0.014824        A  0.001162
1905  -0.001354  0.001545  0.000007 -0.008195        C -0.000842
1906  -0.001578  0.008796 -0.000164  0.015955        D  0.003118
1907  -0.001578  0.008796 -0.000164  0.015955        A  0.004196
1909  -0.001354  0.001545  0.000007 -0.008195        B -0.004277
Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
  • Thanks. If you could explain why the second solution is faster, that would be great; presumably as all vectorised? Good to have two solutions to learn from too. – Carl May 13 '17 at 14:54
  • Both solutions are vectorized, but the second solution presented needs to do a groupby on all of the data, then run a `dot()` across the group. This is basically the solution you suggested, but were not sure how to implement. While the first solution does the dot products for *all* possible categories, then only selects the result for the applicable category. The first does a lot of extra math, but does not need to create the groups by category. I would expect *dot() and then select* to be faster. – Stephen Rauch May 13 '17 at 15:30
  • I also found that I had to set the index of df_wgt to be Category; in the case of my data frame. Just mentioning this for clarity, because the wgts_df in my question above has no index column, thus implying Category is the index. I don't fully understand why this must be the case however. In fact, presumably the Category column doesn't need to be present in wgts_df? because the dot product is performed between every combination of weights and df, and then the desires subset is selected using .lookup(? – Carl May 15 '17 at 16:45
  • In this case setting category as the index of the weights *gets it out of the way*. That is, when looking at the *values* of the dataframe, the index is not considered, and thus category won't be considered one of the weights. – Stephen Rauch May 15 '17 at 16:52
  • Also the category is necessary on the weights for the lookup to work. To help your understanding, I suggest you examine the results of `f[df_wgt.columns].dot(df_wgt.T)`, which is the results prior to the lookup. – Stephen Rauch May 15 '17 at 17:12