5

I am looking for the most efficient solution to replicate dataframe rows. Each row should be replicated x times, where x is unique for each row.

Let's say this is my given dataframe:

| id | count |
|----|-------|
| a  | 1     |
| b  | 2     |
| c  | 5     |

The dataframe as a result should look like this, where each row was replicated by the amount given in column "count":

| id | count |
|----|-------|
| a  | 1     |
| b  | 2     |
| b  | 2     |
| c  | 5     |
| c  | 5     |
| c  | 5     |
| c  | 5     |
| c  | 5     |

A very basic approach would be looping over the dataframe and appending the row x times like this:

data = {'id': ['a', 'b', 'c'], 'count': [1, 2, 5]}
df = pd.DataFrame(data=data)

for index, row in df.iterrows():
    for x in range(row['count']-1):
        df = df.append(pd.Series(row, index=df.columns), ignore_index=True)

df = df.sort_values(by=['id'])
df = df.reset_index(drop=True)

df

While this works for small data frames, it is not very efficient for large dataframes with thousands of rows. Since each line has to be replicated up to 200 times, the final dataframe can contain millions of lines.

Already read about pandas/numpy vectorization, but unfortunately I have no idea if (and how) it could help in this case where I have to add a lot of rows to the dataframe.

Any suggestions how to improve the performance?

Dominik Braun
  • 191
  • 1
  • 1
  • 5

3 Answers3

5

Use Index.repeat if unique index values and then pass to DataFrame.loc:

df1 = df.loc[df.index.repeat(df['count'])].reset_index(drop=True)
print (df1)
  id  count
0  a      1
1  b      2
2  b      2
3  c      5
4  c      5
5  c      5
6  c      5
7  c      5

If possible some duplicates in index values is possible use numpy.repeat and DataFrame.iloc:

print (df)
  id  count
0  a      1
1  b      2
1  c      5

df1 = df.iloc[np.repeat(np.arange(len(df.index)), df['count'])].reset_index(drop=True)
print (df1)
  id  count
0  a      1
1  b      2
2  b      2
3  c      5
4  c      5
5  c      5
6  c      5
7  c      5
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

You could reindex with the count column :

df.reindex(df.index.repeat(df["count"])).reset_index(drop=True)
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
0
In [1]: import numpy as np 
   ...: import pandas as pd                                                                         

In [2]: data = {'id':list(map(chr, range(97, 123))), 'count': pd.Series(np.random.randint(0,500,size
   ...: =26))}                                                                                      

In [3]: df = pd.DataFrame(data)                                                                     

In [4]: df.head()                                                                                   
Out[4]: 
  id  count
0  a    145
1  b    297
2  c     46
3  d    493
4  e     46

In [5]: df_replicate = pd.DataFrame(np.repeat(df.values, df['count'], axis=0),columns=df.columns)   

In [6]: df_replicate.head()                                                                         
Out[6]: 
  id count
0  a   145
1  a   145
2  a   145
3  a   145
4  a   145
Darsh Shukla
  • 289
  • 4
  • 5