0

I want to sample rows from each columns of a dataframe according to a dataframe of weights. All columns of the dataframe of weights sum to 1.

A=pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]]).transpose()
w=pd.DataFrame([[0.2,0.5,0.3],[0.1,0.3,0.6],[0.4,0.5,0.1]])
sampled_data = A.sample(n=10, replace=True, weights=w)

But this code yields the following error

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Obviously I would like the first column of A sampled according to the weights from the first column of w and so on.

With the solution like this:

sampled_data =
  1 2 3
0 2 6 8
1 2 5 7
2 3 4 8
. .....
9 1 6 9
P_Sta
  • 55
  • 1
  • 10
  • `weights` is undefined is that supposed to be `weights=w`? – Henry Ecker Jun 24 '21 at 00:16
  • yes I just eddited the question thank you – P_Sta Jun 24 '21 at 00:18
  • Then I'm unclear on the logic. `sample` is expecting a (1D) list of weights to determine how to choose the `row`. But it appears each cell has its own weight. What do the weights mean? – Henry Ecker Jun 24 '21 at 00:22
  • I want each column of A sampled according to the column of the same index in w – P_Sta Jun 24 '21 at 00:23
  • Given my understanding, it appears that each row of the first dataframe corresponds to a column value in the new dataframe. Which is not a sample operation as you're not sampling rows from `A` but rather generating random values based on weights. This question appears more like how do I randomly generate `m` rows and `n` columns based on `n` value lists and `n` weight lists. And as far as a reasonable way of doing this, I'm unsure. – Henry Ecker Jun 24 '21 at 02:34
  • @HenryEcker I eddited the question, and add a transpose to A – P_Sta Jun 24 '21 at 10:43

1 Answers1

0

It sounds like you want independent samples from each column. If so, I think this does what you want:

import pandas as pd
A=pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]]).transpose()
w=pd.DataFrame([[0.2,0.5,0.3],[0.1,0.3,0.6],[0.4,0.5,0.1]]).transpose()
L=[]
for i in [0,1,2]:
    s=A[i].sample(n=10,replace=True,weights=w[i])
    L.append(s.values)
A_sample=pd.DataFrame(L).transpose()
print(A_sample)

The output is

   0  1  2
0  3  6  7
1  2  5  8
2  3  6  8
3  1  6  7
4  1  5  8
5  3  6  8
6  1  6  9
7  1  6  7
8  2  4  8
9  2  6  7

Note that to make this work, I made A and w be the transposes of what you originally had.

There's probably a slicker way to do this, but I don't know it.

  • Thank you for your answer. Indeed, I could loop for each column, but I was looking for a vectorized solution. Since the documentation for pandas sample indicate the weights parameter can be an ndarray I thought what I'm trying to do could be achieved directly. – P_Sta Jun 24 '21 at 10:03
  • I'm pretty sure there's no direct way to do this, but people with better pandas knowledge might disagree. The `DataFrame.sample` function is designed to sample items long an axis, meaning it really wants to grab entire rows or entire columns rather than sample grab individual cells to assemble a new row that wasn't in the original frame. BTW, if speed is the concern, the for loop should be relatively easy to parallelize (but in that case, numpy arrays will probably be faster than DataFrames anyway). – Jordan Rozum Jun 24 '21 at 14:42