2

I want to read an Excel File and want my code to duplicate rows according to the value in one of the columns in that excel file.

e.g.

    Col 1  Col 2
0   Adam      3
1  Sarah      2
2   John      0

I want my code read above file and copy Adam's row 3 times and Sarah's row 2 times and export in a new file.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
M ob
  • 97
  • 1
  • 8

2 Answers2

7

Use repeat with loc if detault RangeIndex:

print (df.index.repeat(df['Col 2']))
Int64Index([0, 0, 0, 1, 1], dtype='int64')

df = df.loc[df.index.repeat(df['Col 2'])].reset_index(drop=True)
print (df)
   Col 1  Col 2
0   Adam      3
1   Adam      3
2   Adam      3
3  Sarah      2
4  Sarah      2

And then:

df.to_csv(file, index=False)

General solution for duplicated Index or DatetimeIndex is repeat numpy array created by numpy.arange and selecting by positions by iloc:

df = df.iloc[np.arange(len(df)).repeat(df['Col 2'])].reset_index(drop=True)

EDIT:

Solution without np.repeat:

df =df.loc[[c for a, b in zip(df.index, df['Col 2']) for c in [a] * b]].reset_index(drop=True)
print (df)
   Col 1  Col 2
0   Adam      3
1   Adam      3
2   Adam      3
3  Sarah      2
4  Sarah      2
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank you Jezrael for your help but i am getting a TypeError - TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe' – M ob Jun 11 '18 at 11:46
  • Try `df.loc[df.index.repeat(df['Col 2']).astype(np.int64)].reset_index(drop=True)` – jezrael Jun 11 '18 at 11:47
  • import pandas as pd import numpy as np here is the code i am using but still getting Type Error. readFile = pd.read_csv('test.csv') df = pd.DataFrame(readFile) x = df.loc[df.index.repeat(df['Col 2']).astype(np.int64)].reset_index(drop=True) print(x) – M ob Jun 11 '18 at 12:32
  • @Mob - It looks like pandas verxion issue, what is your version of pandas? – jezrael Jun 11 '18 at 12:33
  • Hi Jezrael, Thank you for all your help - pandas: 0.23.0 – M ob Jun 11 '18 at 13:03
  • @Mob - It is bad, I try google solution and it seems numpy bug https://github.com/bytefish/facerec/issues/57 :( – jezrael Jun 11 '18 at 13:04
  • Thank you, i really appreciate your help. can you suggest a different way of repeating rows? – M ob Jun 11 '18 at 13:30
  • @Mob - Hard question, how working `df = df.iloc[np.repeat(np.arange(len(df), dtype=np.int64), df['Col 2'].values.astype(np.int64))].reset_index(drop=True)` ? – jezrael Jun 11 '18 at 13:47
  • 1
    @Mob - added solution without `np.repeat` – jezrael Jun 11 '18 at 14:06
  • Thank you jezrael this one works. Thank you very much, i really appreciate your help. Thank you – M ob Jun 12 '18 at 14:24
0

another way to solve this,

pd.merge(pd.DataFrame(df['Col 1'].repeat(df['Col 2'])),df,on=['Col 1'])
Mohamed Thasin ah
  • 10,754
  • 11
  • 52
  • 111
  • While this code snippet may solve the question, [including an explanation](http://meta.stackexchange.com/questions/114762/explaining-entirely-code-based-answers) really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. – 31piy Jun 11 '18 at 11:38