0

I've got about 7.5 million rows of data in the following format:

ndc_description               effective_date        ...                             
12-HR DECONGEST 120MG CAPLET  2015-08-19            2015-08-26          G   NaN     NaN     1   0.36062     36800005452     Y   C/I     EA
                              2015-07-22            2015-08-12          G   NaN     NaN     1   0.37681     36800005452     Y   C/I     EA
                              2015-06-17            2015-07-15          G   NaN     NaN     1   0.36651     36800005452     Y   C/I     EA
Some Other drug               2016-11-21            2015-08-26          G   NaN     NaN     1   0.36062     36800005452     Y   C/I     EA
                              2016-07-23            2015-08-12          G   NaN     NaN     1   0.37681     36800005452     Y   C/I     EA
                              2016-05-17            2015-07-15          G   NaN     NaN     1   0.36651     36800005452     Y   C/I     EA

ndc_description and effective_date are a multi-indice.

I have an additional dataset that I'm merging with the above. They'll be merged by the columns ndc_description and effective_date (the other columns shown are purely for demonstration that other data of various types exists in the dataset).

Current problem: The dates in each dataset don't match up. In the dataset above, they're (mostly) on a weekly basis, but that can't be guaranteed. In the other dataset, there's also no guaranteed regularity. Therefore, I suppose I'll need to add rows for all dates in between those listed in the effective_date column above so I can then make the merge on ndc_description and effective_date. Is this the best way to go about this process? Because of the amount of data involved, I'd like to optimize the code involved before I end up running all the data through.

Potential solutions: I've seen that .resample() may be of value here, but I've not yet been able to get it to work. Something like this: Cleaned_Price_Data.effective_date.resample('1D', fill_method = 'ffill', level = 1)

I think it would also improve efficiency to include the max & min dates somewhere in the above so that it doesn't ffill anything more than necessary. Also, including the value in the ndc_description so that dates added to the effective_date column aren't duplicated for each identical ndc_description value.

EDIT: Here's some code illustrating the current state of my dataframe and how it should look after transformations are complete. I'm trying to transform a dataframe such as this:

idx = pd.MultiIndex.from_product([['drug_a', 'drug_b', 'drug_c'],
                                  ['2015-08-19', '2015-08-17', '2015-08-14']],
                                 names=['drug_name', 'effective_date'])
col = ['other_data_1', 'other_data_2', 'other_data_3']

pre_transform = pd.DataFrame('-', idx, col)
pre_transform

To one like this (note dates are added):

idx = pd.MultiIndex.from_product([['drug_a', 'drug_b', 'drug_c'],
                                  ['2015-08-19', '2015-08-18', '2015-08-17', '2015-08-16', '2015-08-15', '2015-08-14']],
                                 names=['drug_name', 'effective_date'])
col = ['other_data_1', 'other_data_2', 'other_data_3']

post_change = pd.DataFrame('-', idx, col)
post_change

EDIT 2: I've come up with the following code (via Parfait's answer here) which seems to do the trick:

def expand_dates(ser):
    return pd.DataFrame({'effective_date': pd.date_range(ser['effective_date'].min(), ser['effective_date'].max(), freq='D')})

price_cols = list(Cleaned_Price_Data.columns)

all_effective_dates = Cleaned_Price_Data.groupby(['ndc']).apply(expand_dates).reset_index().merge(Cleaned_Price_Data, how = 'left')[price_cols].ffill()

However, at 55million rows, the file's quite bloated, and I'll be attempting to merge this with another dataset. Any attempts at optimizing this (or suggesting a more efficient alternative) would be greatly appreciated.

alofgran
  • 427
  • 7
  • 18
  • Can you create [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve)? Also add sample of `additional dataset` and expected output? – jezrael Oct 30 '19 at 06:19
  • @jezrael -added reproducible code illustrating the current status of my dataframe, and how I'd like it to end up. The problem itself isn't verifiable as it's more of the "how do I do this" vein, than a "I get this error, what do I do" nature. – alofgran Oct 30 '19 at 17:12

0 Answers0