I am trying to add missing months for each ID
. Added months should have info on ID
and year_month
, and NaN for Product. My code achieves this using apply()
, but is slow -- I am looking for a vectorized version, which can run significantly faster.
Specifically, df.set_index(df.index).groupby('ID').apply(add_missing_months)
takes about 20 seconds on my system with 60 000 rows. I plan to work with data with millions of rows, so I think I need to vectorize the operation. Any help is highly appreciated!
import pandas as pd
df = pd.DataFrame({'ID': [1, 1, 1, 2, 2, 3], 'year_month': ['2020-01-01','2020-08-01','2020-10-01','2020-01-01','2020-07-01','2021-05-01'], 'product':['A','B','C','A','D','C']})
# Enlarge dataset to 60 000 rows
for i in range(9999):
df2 = df.iloc[-6:].copy()
df2['ID'] = df2['ID'] + 3
df = pd.concat([df,df2], axis=0, ignore_index=True)
df['year_month'] = pd.to_datetime(df['year_month'])
df.index = pd.to_datetime(df['year_month'], format = '%Y%m%d')
df = df.drop('year_month', axis = 1)
# The slow function
def add_missing_months(s):
min_d = s.index.min()
max_d = s.index.max()
s = s.reindex(pd.date_range(min_d, max_d, freq='MS'))
return(s)
df = df.set_index(df.index).groupby('ID').apply(add_missing_months)
df = df.drop('ID', axis = 1)
df = df.reset_index()