duplicate row values with the same row index

Question

I need to create a data frame of samples, that duplicates (and not sampled again) if the row index is the same.

nums = [0.1,0.3,0.7,0.5]

country  a   b   c   d 
USA     0.3 0.1 0.5 0.7
USA     0.3 0.1 0.5 0.7
Italy   0.1 0.5 0.7 0.3
UK      0.7 0.1 0.5 0.3
Uk      0.7 0.1 0.5 0.3
UK      0.7 0.1 0.5 0.3

I tried:

for i in df.index:
    df.loc[i] = random.sample(nums)

but each row got a different sample

Cristian Contrera · Answer 1 · 2020-11-02T19:53:09.797

0

You should use groupby function, try this:

df.groupby(df.index).first()

Example:

edited Nov 02 '20 at 19:53

answered Nov 02 '20 at 19:34

Cristian Contrera

713
3
17

I got all NaN values. and my columns moved to indexes – ari6739 Nov 02 '20 at 19:42

Bill Huang · Accepted Answer · 2020-11-03T07:56:00.167

0

You can get the number of rows per country, generate samples for each of them and use the reindex-repeat strategy (this answer) to expand the samples.

# get size
df_out = df.groupby("country").size().to_frame("size")

# generate sample for each country
df_out[["a","b","c","d"]] = df_out.apply(lambda el: random.sample(nums, k=4), axis=1).to_list()

# ***** Replace the above line with the following code *****
# ***** If it didn't work (legacy pandas version?)***
df_out = pd.concat(
    [df_out,
     pd.DataFrame([random.sample(nums, k=4) for i in range(len(df_out))],
                  columns=["a","b","c","d"],
                  index=df_out.index)
     ], axis=1)

# repeat rows based on size
df_out = df_out.reindex(df_out.index.repeat(df_out["size"])).reset_index()

Output

print(df_out)
  country  size    a    b    c    d
0   Italy     1  0.5  0.3  0.1  0.7
1      UK     2  0.3  0.5  0.7  0.3
2      UK     2  0.3  0.5  0.7  0.3
3     USA     2  0.7  0.1  0.3  0.1
4     USA     2  0.7  0.1  0.3  0.1
5      Uk     1  0.1  0.7  0.5  0.5   <- caused by typo in given data

Tested on python 3.7, pandas 1.1.3, 64-bit debian 10 OS

edited Nov 03 '20 at 07:56

answered Nov 02 '20 at 19:50

Bill Huang

4,491
2
13
31

`df_out[["a","b","c","d"]] = df_out.apply(lambda el: random.sample(nums, k=4), axis=1)` None of [Index(["a","b","c","d"], dtype='object')] are in the [columns] – ari6739 Nov 02 '20 at 20:03
Please update pandas to its latest version. If this is not possible or didn't work, use the workaround instead (edited). – Bill Huang Nov 02 '20 at 20:28
I got "Must have equal len keys and value when setting with an iterable" – ari6739 Nov 02 '20 at 21:06
for `df_out[["a"]] = df_out.apply(lambda el: random.sample(nums, k=4)` I get the list of values at "a" – ari6739 Nov 02 '20 at 21:08
May you please provide the COMPLETE steps (including the sample dataset, in a directly reproducible form, not as printed results) in your post? Also denote which version of python and pandas you are using. – Bill Huang Nov 02 '20 at 22:05
`df_out = df.groupby("country").size().to_frame("size") df_out[["vals"]] = df_out.apply(lambda el: random.sample(nums, len(nums)), axis=1) df_out = df_out.reindex(df_out.index.repeat(df_out["size"])).reset_index()` – ari6739 Nov 03 '20 at 07:29
`df2 = pd.DataFrame(index = df.index) df2[list(df.columns)] = df_out.vals.tolist()` – ari6739 Nov 03 '20 at 07:31
I still cannot reproduce your error, but good to see that you've found a workaround! – Bill Huang Nov 03 '20 at 07:59

duplicate row values with the same row index

2 Answers2

Output