I am writing a program that will load in a timeseries excel file into a dataframe, then I create several new columns using some basic calculations. My program is going to sometimes read in excel files that are missing months for some records. So in example below I have monthly sales data for two different stores. The stores open during different months, so their first month-end date will differ. But both should have month end data up until 9/30/2020. In my file, Store BBB has no records for 8/31/2020 and 9/30/2020 because there were no Sales during those months.
Store | Month Opened | State | City | Month End Date | Sales |
---|---|---|---|---|---|
AAA | 5/31/2020 | NY | New York | 5/31/2020 | 1000 |
AAA | 5/31/2020 | NY | New York | 6/30/2020 | 5000 |
AAA | 5/31/2020 | NY | New York | 7/30/2020 | 3000 |
AAA | 5/31/2020 | NY | New York | 8/31/2020 | 4000 |
AAA | 5/31/2020 | NY | New York | 9/30/2020 | 2000 |
BBB | 6/30/2020 | CT | Hartford | 6/30/2020 | 100 |
BBB | 6/30/2020 | CT | Hartford | 7/30/2020 | 200 |
So for any instances like this, I want to be able to add two rows for Store BBB for 8/31 and 9/30. The new rows should use the same Month Opened, State and City from the most recent month-end date. The Sales should be set to 0 for both new rows. As of right now, I do the following steps:
- Create Dataframe "MaxDateData" with the store name and the max Month End Date for each Store and also the max Month End Date for the entire time series data frame, I name this field "Most Recent Date".
Store | Max Month End Date | Most Recent Date |
---|---|---|
AAA | 9/30/2020 | 9/30/2020 |
BBB | 7/30/2020 | 9/30/2020 |
- Create Dataframe "MostRecent" with the most recent row from main time series dataframe. To do this, I do an inner join between the time series dataframe and the MaxDateData on the Store Name and Max Month End Date.
Store | Month Opened | State | City | Month End Date | Sales | Max Month End Date | Most Recent Date |
---|---|---|---|---|---|---|---|
AAA | 5/31/2020 | NY | New York | 9/30/2020 | 2000 | 9/30/2020 | 9/30/2020 |
BBB | 6/30/2020 | CT | Hartford | 7/30/2020 | 200 | 7/30/2020 | 9/30/2020 |
- Create a dataframe "RequireBackfill_MostRecent" using a where clause to filter for stores where Max Month End Date < Most Recent Date. See code below. So in this example, the RequireBackfill_MostRecent table will only have a row for store BBB.
RequireBackfill_Stores_MostRecent = MaxDateData.where(MaxDateData['Max Month End Date'] <MaxDateData['Most Recent Date'])
RequireBackfill_MostRecent = MostRecent.merge(RequireBackfill_Stores_MostRecent,how='inner')
- I then use two nested for loops to cycle through dates that I need to fill in. It leverages the RequireBackfill_MostRecent dataframe which would only contain Store BBB.
X=[]
end = MaxDateData['Most Recent Date'][0]
for i in MonthlyData['Month End Date'].unique():
per1 = pd.date_range(start = i, end = end, freq ='M')
for val in per1:
Data=[]
Data = RequireBackfill_MostRecent[["Store"
,"Month Opened"
,"City"
,"State"
]].where(RequireBackfill_MostRecent['Max Month End date']==i).dropna()
Data["Month End Date"]= val
Data["Sales"]= 0
X.append(Data)
NewData = pd.concat(X)
- I then add in the NewData to my timeseries dataframe using concat
FullData_List = [MonthlyData,NewData]
FullData=pd.concat(FullData_List)
This whole process works but is there a much more efficient way to do this? This could become costly when I start to work with larger data.