2

Hopefully, this is not a duplicate. I have two data frames: The first data frame has size n x m, and each cell contains a list of numeric values of size k. The second data frame has size n x k, and each cell contains a single numeric value (essentially a weight). What I need to do is find an efficient way to do the following: For each cell in the first data frame, multiply the list of values in that cell by the series of values in the second data frame that corresponds to the same row. Then, sum all of the products together.

Here is a simple example: df1 has size 2x3, and each cell contains a list of 3 values. df2 has size 2x3, and each cell contains a single value.

In [3]: df1
Out[3]:
           x          y          z
0  [1, 2, 3]  [2, 3, 4]  [3, 4, 5]
1  [4, 5, 6]  [5, 6, 7]  [6, 7, 8]

In [5]: df2
Out[5]:
   first  second  third
0      1       1      1
1      2       2      2

The intermediate result after multiplying the lists in df1 by the corresponding row series from df2 is:

           x          y          z
0  [1, 2, 3]  [2, 3, 4]  [3, 4, 5]
1  [8, 10, 12]  [10, 12, 14]  [12, 14, 16]

And the final result should be:

           x          y          z
0          6          9         12
1         30         36         42

Right now I'm just using a nested for loop, which works but is incredibly inefficient (of course). I tried using itertuples(), but I couldn't get that to work properly. Any help here is much appreciated!

rhozzy
  • 332
  • 1
  • 9
  • Don't store your data as lists. In this case, your data is 3-D, you can do 2-level index `(0,x), ...` and single-level columns `0,1,2`. Then it would be straight-forward to multiply and add. – Quang Hoang Dec 12 '21 at 20:37
  • Hmm, that's an interesting approach. Do you know if there's a simple way to translate the lists into the "third dimension" in Pandas? – rhozzy Dec 12 '21 at 20:43
  • `df.stack().explode()`. – Quang Hoang Dec 12 '21 at 20:48
  • What is a "list of numeric values"? A python list or a numpy array? – Corralien Dec 12 '21 at 20:53
  • 1
    @QuangHoang, using your approach I would do the following: `(df1.stack().explode()*df2.stack()).reset_index().pivot_table(values=0,index='level_0',columns='level_1',aggfunc=sum).reset_index(drop=True).rename_axis(None, axis=1)` It does work and I'm sure it is more efficient than nested for loops. But, I think the accepted answer from @Corralien is even better. Thank you for the help, though! – rhozzy Dec 12 '21 at 21:05

1 Answers1

2

Try:

# Convert each list to numpy array if it's not already the case
df1 = df1.applymap(np.array)

vals = np.sum((df1.values * df2.values), axis=1)
out = pd.DataFrame.from_records(vals, index=df1.index, columns=df1.columns)

Output:

>>> out
    x   y   z
0   6   9  12
1  30  36  42

# Intermediate result
>>> df1.values * df2.values
 [[array([1, 2, 3]) array([2, 3, 4]) array([3, 4, 5])]
 [array([ 8, 10, 12]) array([10, 12, 14]) array([12, 14, 16])]]
Corralien
  • 109,409
  • 8
  • 28
  • 52