0

I need to create a data frame of samples, that duplicates (and not sampled again) if the row index is the same.

nums = [0.1,0.3,0.7,0.5]

country  a   b   c   d 
USA     0.3 0.1 0.5 0.7
USA     0.3 0.1 0.5 0.7
Italy   0.1 0.5 0.7 0.3
UK      0.7 0.1 0.5 0.3
Uk      0.7 0.1 0.5 0.3
UK      0.7 0.1 0.5 0.3

I tried:

for i in df.index:
    df.loc[i] = random.sample(nums)

but each row got a different sample

ari6739
  • 91
  • 4

2 Answers2

0

You should use groupby function, try this:

df.groupby(df.index).first()

Example: enter image description here

0

You can get the number of rows per country, generate samples for each of them and use the reindex-repeat strategy (this answer) to expand the samples.

# get size
df_out = df.groupby("country").size().to_frame("size")

# generate sample for each country
df_out[["a","b","c","d"]] = df_out.apply(lambda el: random.sample(nums, k=4), axis=1).to_list()

# ***** Replace the above line with the following code *****
# ***** If it didn't work (legacy pandas version?)***
df_out = pd.concat(
    [df_out,
     pd.DataFrame([random.sample(nums, k=4) for i in range(len(df_out))],
                  columns=["a","b","c","d"],
                  index=df_out.index)
     ], axis=1)

# repeat rows based on size
df_out = df_out.reindex(df_out.index.repeat(df_out["size"])).reset_index()

Output

print(df_out)
  country  size    a    b    c    d
0   Italy     1  0.5  0.3  0.1  0.7
1      UK     2  0.3  0.5  0.7  0.3
2      UK     2  0.3  0.5  0.7  0.3
3     USA     2  0.7  0.1  0.3  0.1
4     USA     2  0.7  0.1  0.3  0.1
5      Uk     1  0.1  0.7  0.5  0.5   <- caused by typo in given data

Tested on python 3.7, pandas 1.1.3, 64-bit debian 10 OS

Bill Huang
  • 4,491
  • 2
  • 13
  • 31
  • `df_out[["a","b","c","d"]] = df_out.apply(lambda el: random.sample(nums, k=4), axis=1)` None of [Index(["a","b","c","d"], dtype='object')] are in the [columns] – ari6739 Nov 02 '20 at 20:03
  • Please update pandas to its latest version. If this is not possible or didn't work, use the workaround instead (edited). – Bill Huang Nov 02 '20 at 20:28
  • I got "Must have equal len keys and value when setting with an iterable" – ari6739 Nov 02 '20 at 21:06
  • for `df_out[["a"]] = df_out.apply(lambda el: random.sample(nums, k=4)` I get the list of values at "a" – ari6739 Nov 02 '20 at 21:08
  • May you please provide the COMPLETE steps (including the sample dataset, in a directly reproducible form, not as printed results) in your post? Also denote which version of python and pandas you are using. – Bill Huang Nov 02 '20 at 22:05
  • `df_out = df.groupby("country").size().to_frame("size") df_out[["vals"]] = df_out.apply(lambda el: random.sample(nums, len(nums)), axis=1) df_out = df_out.reindex(df_out.index.repeat(df_out["size"])).reset_index()` – ari6739 Nov 03 '20 at 07:29
  • `df2 = pd.DataFrame(index = df.index) df2[list(df.columns)] = df_out.vals.tolist()` – ari6739 Nov 03 '20 at 07:31
  • I still cannot reproduce your error, but good to see that you've found a workaround! – Bill Huang Nov 03 '20 at 07:59