I've got about 7.5 million rows of data in the following format:
ndc_description effective_date ...
12-HR DECONGEST 120MG CAPLET 2015-08-19 2015-08-26 G NaN NaN 1 0.36062 36800005452 Y C/I EA
2015-07-22 2015-08-12 G NaN NaN 1 0.37681 36800005452 Y C/I EA
2015-06-17 2015-07-15 G NaN NaN 1 0.36651 36800005452 Y C/I EA
Some Other drug 2016-11-21 2015-08-26 G NaN NaN 1 0.36062 36800005452 Y C/I EA
2016-07-23 2015-08-12 G NaN NaN 1 0.37681 36800005452 Y C/I EA
2016-05-17 2015-07-15 G NaN NaN 1 0.36651 36800005452 Y C/I EA
ndc_description and effective_date are a multi-indice.
I have an additional dataset that I'm merging with the above. They'll be merged by the columns ndc_description and effective_date (the other columns shown are purely for demonstration that other data of various types exists in the dataset).
Current problem: The dates in each dataset don't match up. In the dataset above, they're (mostly) on a weekly basis, but that can't be guaranteed. In the other dataset, there's also no guaranteed regularity. Therefore, I suppose I'll need to add rows for all dates in between those listed in the effective_date column above so I can then make the merge on ndc_description and effective_date. Is this the best way to go about this process? Because of the amount of data involved, I'd like to optimize the code involved before I end up running all the data through.
Potential solutions: I've seen that .resample() may be of value here, but I've not yet been able to get it to work. Something like this: Cleaned_Price_Data.effective_date.resample('1D', fill_method = 'ffill', level = 1)
I think it would also improve efficiency to include the max & min dates somewhere in the above so that it doesn't ffill
anything more than necessary. Also, including the value in the ndc_description so that dates added to the effective_date column aren't duplicated for each identical ndc_description value.
EDIT: Here's some code illustrating the current state of my dataframe and how it should look after transformations are complete. I'm trying to transform a dataframe such as this:
idx = pd.MultiIndex.from_product([['drug_a', 'drug_b', 'drug_c'],
['2015-08-19', '2015-08-17', '2015-08-14']],
names=['drug_name', 'effective_date'])
col = ['other_data_1', 'other_data_2', 'other_data_3']
pre_transform = pd.DataFrame('-', idx, col)
pre_transform
To one like this (note dates are added):
idx = pd.MultiIndex.from_product([['drug_a', 'drug_b', 'drug_c'],
['2015-08-19', '2015-08-18', '2015-08-17', '2015-08-16', '2015-08-15', '2015-08-14']],
names=['drug_name', 'effective_date'])
col = ['other_data_1', 'other_data_2', 'other_data_3']
post_change = pd.DataFrame('-', idx, col)
post_change
EDIT 2: I've come up with the following code (via Parfait's answer here) which seems to do the trick:
def expand_dates(ser):
return pd.DataFrame({'effective_date': pd.date_range(ser['effective_date'].min(), ser['effective_date'].max(), freq='D')})
price_cols = list(Cleaned_Price_Data.columns)
all_effective_dates = Cleaned_Price_Data.groupby(['ndc']).apply(expand_dates).reset_index().merge(Cleaned_Price_Data, how = 'left')[price_cols].ffill()
However, at 55million rows, the file's quite bloated, and I'll be attempting to merge this with another dataset. Any attempts at optimizing this (or suggesting a more efficient alternative) would be greatly appreciated.