-1

I have two dataframes: the both have 5 columns, but the first one has 100 rows, and the second one just one row. I should multiply every row of the first dataframe by this single row of the second, and than summarize the value of columns in each row and this value in the 6th new column 'sum of multipliations". I've seen "np.dot" operation, but I'm not sure that I could apply it to dataframes. Also I'm looking for the pythonic/pandas operation or method, if it's possible to replace a little bit heavy numpy code from scratch? Thank you in advance for your advice.

Amanda
  • 835
  • 2
  • 9
  • 17

2 Answers2

1

I think you can convert DataFrames to numpy arrays by values, multiple them and last sum:

import pandas as pd
import numpy as np

np.random.seed(1)
df1 = pd.DataFrame(np.random.randint(10, size=(1,5)))
df1.columns = list('ABCDE')
print df1
   A  B  C  D  E
0  5  8  9  5  0

np.random.seed(0)
df2 = pd.DataFrame(np.random.randint(10,size=(10,5)))
df2.columns = list('ABCDE')
print df2
   A  B  C  D  E
0  5  0  3  3  7
1  9  3  5  2  4
2  7  6  8  8  1
3  6  7  7  8  1
4  5  9  8  9  4
5  3  0  3  5  0
6  2  3  8  1  3
7  3  3  7  0  1
8  9  9  0  4  7
9  3  2  7  2  0
print df2.values * df1.values
[[25  0 27 15  0]
 [45 24 45 10  0]
 [35 48 72 40  0]
 [30 56 63 40  0]
 [25 72 72 45  0]
 [15  0 27 25  0]
 [10 24 72  5  0]
 [15 24 63  0  0]
 [45 72  0 20  0]
 [15 16 63 10  0]]

df = pd.DataFrame(df2.values * df1.values)
df['sum'] = df.sum(axis=1)
print df
    0   1   2   3  4  sum
0  25   0  27  15  0   67
1  45  24  45  10  0  124
2  35  48  72  40  0  195
3  30  56  63  40  0  189
4  25  72  72  45  0  214
5  15   0  27  25  0   67
6  10  24  72   5  0  111
7  15  24  63   0  0  102
8  45  72   0  20  0  137
9  15  16  63  10  0  104

Timing:

In [1185]: %timeit df2.mul(df1.ix[0], axis=1)
The slowest run took 5.07 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 287 µs per loop

In [1186]: %timeit pd.DataFrame(df2.values * df1.values)
The slowest run took 6.31 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 98 µs per loop
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

You are probably looking for something like this:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({ 'A' : [1.1,2.7, 3.4], 
                     'B' : [-1.,-2.5, -3.9]})

df1['sum of multipliations']=df1.sum(axis = 1)


df2 = pd.DataFrame({ 'A' : [2.], 
                     'B' : [3.], 
                     'sum of multipliations' : [1.]})

print df1
print df2

row = df2.ix[0]
df5=df1.mul(row, axis=1)
df5.loc['Total']= df5.sum()
print df5
tfv
  • 6,016
  • 4
  • 36
  • 67