4

I am writing a program that will load in a timeseries excel file into a dataframe, then I create several new columns using some basic calculations. My program is going to sometimes read in excel files that are missing months for some records. So in example below I have monthly sales data for two different stores. The stores open during different months, so their first month-end date will differ. But both should have month end data up until 9/30/2020. In my file, Store BBB has no records for 8/31/2020 and 9/30/2020 because there were no Sales during those months.

Store Month Opened State City Month End Date Sales
AAA 5/31/2020 NY New York 5/31/2020 1000
AAA 5/31/2020 NY New York 6/30/2020 5000
AAA 5/31/2020 NY New York 7/30/2020 3000
AAA 5/31/2020 NY New York 8/31/2020 4000
AAA 5/31/2020 NY New York 9/30/2020 2000
BBB 6/30/2020 CT Hartford 6/30/2020 100
BBB 6/30/2020 CT Hartford 7/30/2020 200

So for any instances like this, I want to be able to add two rows for Store BBB for 8/31 and 9/30. The new rows should use the same Month Opened, State and City from the most recent month-end date. The Sales should be set to 0 for both new rows. As of right now, I do the following steps:

  1. Create Dataframe "MaxDateData" with the store name and the max Month End Date for each Store and also the max Month End Date for the entire time series data frame, I name this field "Most Recent Date".
Store Max Month End Date Most Recent Date
AAA 9/30/2020 9/30/2020
BBB 7/30/2020 9/30/2020
  1. Create Dataframe "MostRecent" with the most recent row from main time series dataframe. To do this, I do an inner join between the time series dataframe and the MaxDateData on the Store Name and Max Month End Date.
Store Month Opened State City Month End Date Sales Max Month End Date Most Recent Date
AAA 5/31/2020 NY New York 9/30/2020 2000 9/30/2020 9/30/2020
BBB 6/30/2020 CT Hartford 7/30/2020 200 7/30/2020 9/30/2020
  1. Create a dataframe "RequireBackfill_MostRecent" using a where clause to filter for stores where Max Month End Date < Most Recent Date. See code below. So in this example, the RequireBackfill_MostRecent table will only have a row for store BBB.
RequireBackfill_Stores_MostRecent = MaxDateData.where(MaxDateData['Max Month End Date'] <MaxDateData['Most Recent Date'])
RequireBackfill_MostRecent = MostRecent.merge(RequireBackfill_Stores_MostRecent,how='inner')
  1. I then use two nested for loops to cycle through dates that I need to fill in. It leverages the RequireBackfill_MostRecent dataframe which would only contain Store BBB.
X=[]
end = MaxDateData['Most Recent Date'][0]
for i in MonthlyData['Month End Date'].unique():
    per1 = pd.date_range(start = i,  end = end, freq ='M') 
    for val in per1: 
        Data=[]
        Data = RequireBackfill_MostRecent[["Store"
                                           ,"Month Opened"
                                           ,"City"
                                           ,"State"
                                           ]].where(RequireBackfill_MostRecent['Max Month End date']==i).dropna()   

        Data["Month End Date"]= val                
        Data["Sales"]= 0
        X.append(Data)
NewData = pd.concat(X) 
  1. I then add in the NewData to my timeseries dataframe using concat
FullData_List = [MonthlyData,NewData]
FullData=pd.concat(FullData_List)

This whole process works but is there a much more efficient way to do this? This could become costly when I start to work with larger data.

Safak Ozdek
  • 906
  • 11
  • 18

2 Answers2

0
  1. just try upsample of the DateTime index. ref: pandas-resample-upsample-last-date-edge-of-data
# group by `Store`
# with `Month End Date` column show be converted to DateTime

group.set_index(['Month End Date']).resample('M').asfreq()
  1. be notice that: 7/30/2020 is not the end day of July. 7/31/2020 is. so Using this method 7/30/2020 will be a problem(convert the Month End Date as the truely end date).
Ferris
  • 5,325
  • 1
  • 14
  • 23
0

Here's the step by step approach to do this. If you have questions, let me know.

import pandas as pd
pd.set_option('display.max_columns', None)
c = ['Store','Month Opened','State','City','Month End Date','Sales']
d = [['AAA','5/31/2020','NY','New York','5/31/2020',1000],
['AAA','5/31/2020','NY','New York','6/30/2020',5000],
['AAA','5/31/2020','NY','New York','7/30/2020',3000],
['AAA','5/31/2020','NY','New York','8/31/2020',4000],
['AAA','5/31/2020','NY','New York','9/30/2020',2000],
['BBB','6/30/2020','CT','Hartford','6/30/2020',100],
['BBB','6/30/2020','CT','Hartford','7/30/2020',200],
['CCC','3/31/2020','NJ','Cranbury','3/31/2020',1500]]

df = pd.DataFrame(d,columns = c)
df['Month Opened'] = pd.to_datetime(df['Month Opened'])
df['Month End Date'] = pd.to_datetime(df['Month End Date'])

#select last entry for each Store
df1 = df.sort_values('Month End Date').drop_duplicates('Store', keep='last').copy()

#delete all rows that have 2020-09-30. We want only ones that are less than 2020-09-30
df1 = df1[df1['Month End Date'] != '2020-09-30']

#set target end date to 2020-09-30
df1['Target_End_Date'] = pd.to_datetime ('2020-09-30')

#calculate how many rows to repeat
df1['repeats'] = df1['Target_End_Date'].dt.to_period('M').astype(int) - df1['Month End Date'].dt.to_period('M').astype(int)

#add 1 month to month end so we can start repeating from here
df1['Month End Date'] = df1['Month End Date'] + pd.DateOffset(months =1)

#set sales value as 0 per requirement
df1['Sales'] = 0

#repeat each row by the value in column repeats
df1 = df1.loc[df1.index.repeat(df1.repeats)].reset_index(drop=True)

#reset repeats to start from 0 thru n using groupby cumcouunt
#this will be used to calculate months to increment from month end date
df1['repeats'] = df1.groupby('Store').cumcount()

#update month end date based on value in repeats
df1['Month End Date'] = df1.apply(lambda x: x['Month End Date'] + pd.DateOffset(months = x['repeats']), axis=1)

#set end date to last day of the month
df1['Month End Date'] = pd.to_datetime(df1['Month End Date']) + pd.offsets.MonthEnd(0)

#drop columns that we don't need anymore. required before we concat dfs
df1.drop(columns=['Target_End_Date','repeats'],inplace=True)

#concat df and df1 to get the final dataframe
df = pd.concat([df, df1], ignore_index=True)

#sort values by Store and Month End Date
df = df.sort_values(by=['Store','Month End Date'],ignore_index=True)

print (df)

The output of this is:

   Store Month Opened State      City Month End Date  Sales
0    AAA   2020-05-31    NY  New York     2020-05-31   1000
1    AAA   2020-05-31    NY  New York     2020-06-30   5000
2    AAA   2020-05-31    NY  New York     2020-07-30   3000
3    AAA   2020-05-31    NY  New York     2020-08-31   4000
4    AAA   2020-05-31    NY  New York     2020-09-30   2000
5    BBB   2020-06-30    CT  Hartford     2020-06-30    100
6    BBB   2020-06-30    CT  Hartford     2020-07-30    200
7    BBB   2020-06-30    CT  Hartford     2020-08-30      0
8    BBB   2020-06-30    CT  Hartford     2020-09-30      0
9    CCC   2020-03-31    NJ  Cranbury     2020-03-31   1500
10   CCC   2020-03-31    NJ  Cranbury     2020-04-30      0
11   CCC   2020-03-31    NJ  Cranbury     2020-05-31      0
12   CCC   2020-03-31    NJ  Cranbury     2020-06-30      0
13   CCC   2020-03-31    NJ  Cranbury     2020-07-31      0
14   CCC   2020-03-31    NJ  Cranbury     2020-08-31      0
15   CCC   2020-03-31    NJ  Cranbury     2020-09-30      0

Note I added one more entry with CCC to show you more variation.

Joe Ferndz
  • 8,417
  • 2
  • 13
  • 33