I am looking for the most efficient solution to replicate dataframe rows. Each row should be replicated x times, where x is unique for each row.
Let's say this is my given dataframe:
| id | count |
|----|-------|
| a | 1 |
| b | 2 |
| c | 5 |
The dataframe as a result should look like this, where each row was replicated by the amount given in column "count":
| id | count |
|----|-------|
| a | 1 |
| b | 2 |
| b | 2 |
| c | 5 |
| c | 5 |
| c | 5 |
| c | 5 |
| c | 5 |
A very basic approach would be looping over the dataframe and appending the row x times like this:
data = {'id': ['a', 'b', 'c'], 'count': [1, 2, 5]}
df = pd.DataFrame(data=data)
for index, row in df.iterrows():
for x in range(row['count']-1):
df = df.append(pd.Series(row, index=df.columns), ignore_index=True)
df = df.sort_values(by=['id'])
df = df.reset_index(drop=True)
df
While this works for small data frames, it is not very efficient for large dataframes with thousands of rows. Since each line has to be replicated up to 200 times, the final dataframe can contain millions of lines.
Already read about pandas/numpy vectorization, but unfortunately I have no idea if (and how) it could help in this case where I have to add a lot of rows to the dataframe.
Any suggestions how to improve the performance?