0

Working on Pandas I have performance issue on a step. It is working on a small amount of data, but I can' t have an answer in reasonable time on larger amount.

I have a dataframe like this:

ColA ColB ColC start      end
1    1    1    2020-01-01 2021-01-01

with 715K rows like this one and the 5 columns you see, dates are different for each row

I want to change the granularity of the dataframe to have as such rows as there are dates in intervalls

Something like this:

ColA ColB ColC Date
1    1    1    2020-01-01
1    1    1    2020-01-02
[...]
1    1    1    2020-12-31
1    1    1    2021-01-01

As I estimate intervalls are in average composed by 100 dates, I should have something like 71.5M rows at the end

I tried this:

df2= p.DataFrame(columns=['ColA','ColB', 'ColC','DATE'])
for index, row in df1.iterrows():
        ColA = row['ColA']
        ColB = row['ColB']
        ColC = row['ColC']
        start_date = p.to_datetime(row['start'])
        end_date = p.to_datetime(row['end'])   
        delta = end_date - start_date  
        for i in range(delta.days + 1):
            day = start_date + timedelta(days=i)
            new_row = {'ColA': CoLA, 'ColB': ColB, 'ColC':ColC, 'DATE':day}
            df2= df2.append(new_row, ignore_index=True)

but it have has been running for hours without results :(

Do you know how I can do better? Thanks for answers

ImFabien75
  • 157
  • 1
  • 9

1 Answers1

0

Try create the whole column first, then explode:

df['Date'] = df.apply(lambda x: pd.date_range(x['start'], x['end'], freq='D').tolist(), axis=1)

df.drop(columns=['start','end']).explode('Date')

Output:

    ColA  ColB  ColC       Date
0      1     1     1 2020-01-01
0      1     1     1 2020-01-02
0      1     1     1 2020-01-03
0      1     1     1 2020-01-04
0      1     1     1 2020-01-05
..   ...   ...   ...        ...
0      1     1     1 2020-12-28
0      1     1     1 2020-12-29
0      1     1     1 2020-12-30
0      1     1     1 2020-12-31
0      1     1     1 2021-01-01

Note Your data is relatively big both because of original size and the date range. This would be a bit faster, but first op (creating Date) can still be slow.

Quang Hoang
  • 146,074
  • 10
  • 56
  • 74