1

Input df:

  ID    Date   Start    End
  ABC   2012   15       17
  ABC   2013   11       20
  XYZ   2012   30       50
  XYZ   2014   50       100

Expected Output outputdf:

  ID    Date   Variable Value
  ABC   2012   Start    15
  ABC   2012   End      17
  ABC   2013   Start    11
  ABC   2013   End      20
  XYZ   2012   Start    30
  XYZ   2012   End      50
  XYZ   2014   Start    50
  XYZ   2014   End      100

Code:

columns=['ID','Date','Variable','Value']
outputdf=pd.DataFrame(index=range(len(df)*2),columns=columns)
tempPos=0
for pos,id in enumerate(df['ID']):
   outputdf.loc[[tempPos,tempPos+1],'ID']=df.loc[pos,'ID']
   outputdf.loc[[tempPos,tempPos+1],'Date']=df.loc[pos,'Date']
   outputdf.loc[tempPos,'Variable']="Start"
   outputdf.loc[tempPos+1,'Variable']="End"
   outputdf.loc[tempPos,'Value']=df.loc[pos,'Start']
   outputdf.loc[tempPos+1,'Value']=df.loc[pos,'End']
   tempPos=tempPos+2

I have more than 100k rows and it takes a very long time. How do I optimize this? Any help is appreciated. Thanks!

Jain
  • 959
  • 2
  • 13
  • 31

1 Answers1

2

Use DataFrame.melt:

(df.melt(id_vars=['ID','Date'],
              var_name='Variable',
              value_name='Value').sort_values(['ID','Date']))

or

(df.set_index(['ID','Date'])
   .rename_axis(columns='Variable')
   .stack()
   .reset_index(name='Value'))

Output

    ID  Date Variable  Value
0  ABC  2012    Start     15
4  ABC  2012      End     17
1  ABC  2013    Start     11
5  ABC  2013      End     20
2  XYZ  2012    Start     30
6  XYZ  2012      End     50
3  XYZ  2014    Start     50
7  XYZ  2014      End    100

Times

%%timeit
(df.set_index(['ID','Date'])
   .rename_axis(columns='Variable')
   .stack()
   .reset_index(name='Value'))
5.3 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
(df.melt(id_vars=['ID','Date'],
          var_name='Variable',
          value_name='Value').sort_values(['ID','Date']))
5.77 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
columns=['ID','Date','Variable','Value']
outputdf=pd.DataFrame(index=range(len(df)*2),columns=columns)
tempPos=0
for pos,id in enumerate(df['ID']):
   outputdf.loc[[tempPos,tempPos+1],'ID']=df.loc[pos,'ID']
   outputdf.loc[[tempPos,tempPos+1],'Date']=df.loc[pos,'Date']
   outputdf.loc[tempPos,'Variable']="Start"
   outputdf.loc[tempPos+1,'Variable']="End"
   outputdf.loc[tempPos,'Value']=df.loc[pos,'Start']
   outputdf.loc[tempPos+1,'Value']=df.loc[pos,'End']
   tempPos=tempPos+2
9.72 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
ansev
  • 30,322
  • 5
  • 17
  • 31