Duplicate rows, according to value in a column

Question

I want to read an Excel File and want my code to duplicate rows according to the value in one of the columns in that excel file.

e.g.

    Col 1  Col 2
0   Adam      3
1  Sarah      2
2   John      0

I want my code read above file and copy Adam's row 3 times and Sarah's row 2 times and export in a new file.

jezrael · Accepted Answer · 2018-06-11T14:06:31.473

7

Use repeat with loc if detault RangeIndex:

print (df.index.repeat(df['Col 2']))
Int64Index([0, 0, 0, 1, 1], dtype='int64')

df = df.loc[df.index.repeat(df['Col 2'])].reset_index(drop=True)
print (df)
   Col 1  Col 2
0   Adam      3
1   Adam      3
2   Adam      3
3  Sarah      2
4  Sarah      2

And then:

df.to_csv(file, index=False)

General solution for duplicated Index or DatetimeIndex is repeat numpy array created by numpy.arange and selecting by positions by iloc:

df = df.iloc[np.arange(len(df)).repeat(df['Col 2'])].reset_index(drop=True)

EDIT:

Solution without np.repeat:

df =df.loc[[c for a, b in zip(df.index, df['Col 2']) for c in [a] * b]].reset_index(drop=True)
print (df)
   Col 1  Col 2
0   Adam      3
1   Adam      3
2   Adam      3
3  Sarah      2
4  Sarah      2

edited Jun 11 '18 at 14:06

answered Jun 11 '18 at 11:07

jezrael

822,522
95
1,334
1,252

Thank you Jezrael for your help but i am getting a TypeError - TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe' – M ob Jun 11 '18 at 11:46
Try `df.loc[df.index.repeat(df['Col 2']).astype(np.int64)].reset_index(drop=True)` – jezrael Jun 11 '18 at 11:47
import pandas as pd import numpy as np here is the code i am using but still getting Type Error. readFile = pd.read_csv('test.csv') df = pd.DataFrame(readFile) x = df.loc[df.index.repeat(df['Col 2']).astype(np.int64)].reset_index(drop=True) print(x) – M ob Jun 11 '18 at 12:32
@Mob - It looks like pandas verxion issue, what is your version of pandas? – jezrael Jun 11 '18 at 12:33
Hi Jezrael, Thank you for all your help - pandas: 0.23.0 – M ob Jun 11 '18 at 13:03
@Mob - It is bad, I try google solution and it seems numpy bug https://github.com/bytefish/facerec/issues/57 :( – jezrael Jun 11 '18 at 13:04
Thank you, i really appreciate your help. can you suggest a different way of repeating rows? – M ob Jun 11 '18 at 13:30
@Mob - Hard question, how working `df = df.iloc[np.repeat(np.arange(len(df), dtype=np.int64), df['Col 2'].values.astype(np.int64))].reset_index(drop=True)` ? – jezrael Jun 11 '18 at 13:47
1

@Mob - added solution without `np.repeat` – jezrael Jun 11 '18 at 14:06
Thank you jezrael this one works. Thank you very much, i really appreciate your help. Thank you – M ob Jun 12 '18 at 14:24

score 0 · Answer 2 · answered Jun 11 '18 at 11:17

0

another way to solve this,

pd.merge(pd.DataFrame(df['Col 1'].repeat(df['Col 2'])),df,on=['Col 1'])

answered Jun 11 '18 at 11:17

Mohamed Thasin ah

10,754
11
52
111

While this code snippet may solve the question, [including an explanation](http://meta.stackexchange.com/questions/114762/explaining-entirely-code-based-answers) really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. – 31piy Jun 11 '18 at 11:38

Duplicate rows, according to value in a column

2 Answers2

Linked