0

I am attempting to expand the records in a data frame between two dates. Given the input file of single entry for each record, I want to expand it based on a given date.

Here is an example of the input: Input Data

Here is an example of the desired expanded output: Output data

Based on some other examples and documentation online, what I attempted to do was expand out the data frame on a 6 month time frame to get two records for each year, then I corrected the dates based on the birthday of the records using a counter to determine the split for before and after birthday.

df_expand['DATE'] = [pd.date_range(s, e, freq='6M') for s, e in
              zip(pd.to_datetime(df_expand['Exposure Start']),
                  pd.to_datetime(df_expand['Exposure Stop']))]

df_expand = df_expand.explode('DATE').drop(['Exposure Start', 'Exposure Stop'], axis=1)

df_merged['counter'] = range(len(df_merged))
df_merged['start end'] = np.where(df_merged['counter'] % 2 != 0, 1, 0)
    
df_merged['DoB Year'] = df_merged['DoB'].dt.year
df_merged['DoB Month'] = df_merged['DoB'].dt.month
df_merged['DoB Day'] = df_merged['DoB'].dt.day

df_merged.loc[df_merged['start end'] == 0, 'Exposure Start'] = '1/1/'+ df_merged['Calendar Year'].astype(str)
df_merged.loc[df_merged['start end'] == 1, 'Exposure Start'] = df_merged['DoB Month'].astype(str) + '/' + (df_merged['DoB Day'].astype(int)+1).astype(str) + '/' + df_merged['Calendar Year'].astype(str)


df_merged.loc[df_merged['start end'] == 0, 'Exposure Stop'] = df_merged['DoB Month'].astype(str) + '/' + df_merged['DoB Day'].astype(str) + '/' + df_merged['Calendar Year'].astype(str)
df_merged.loc[df_merged['start end'] == 1, 'Exposure Stop'] = '12/31/'+ df_merged['Calendar Year'].astype(str)

This solution is clearly not elegant, and while it worked originally for my proof of concept, it is now running into issues with edge cases involving rules for the Exposure Start.

Study years are split into 2 separate periods, around the record's birthday.

The initial exposure begins 1/1 of the study year (or, the date that the record enters the study, whichever comes later) and goes through the day before the birthday (or non-death exit date, if that comes sooner).

The 2nd period goes from the birthday to the end of the calendar year (or non-death exit date, if that comes sooner). Where a death is observed, exposure is continued through the next birthday.

An iterative solution is probably better suited, but this was the documentation and guidance I received.

jn1351
  • 53
  • 7

1 Answers1

0
df_merged = pd.read_excel("inputdatawithtestcase.xlsx")

df_merged['DATE'] = [pd.date_range(s, e, freq='6M') for s, e in
              zip(pd.to_datetime(df_merged['Exposure Start']),
                  pd.to_datetime(df_merged['Exposure Stop']))]

df_merged = df_merged.explode('DATE')

df_merged['counter'] = range(len(df_merged))
df_merged['start end'] = np.where(df_merged['counter'] % 2 != 0, 1, 0)
    
df_merged['DoB Year'] = df_merged['DoB'].dt.year
df_merged['DoB Month'] = df_merged['DoB'].dt.month
df_merged['DoB Day'] = df_merged['DoB'].dt.day

df_merged = df_merged.reset_index()

df_merged["Exposure Start month"] = df_merged["Exposure Start"].dt.month
df_merged["Exposure Start day"] = df_merged["Exposure Start"].dt.day

df_merged["new_perfect_year"] = df_merged["DATE"].dt.year

df_merged["start end"].loc[3]

Last_column = []
second_last_column = []

for a in range(len(df_merged)):
    if a>=1:
        if df_merged["DoB Year"].loc[a] != match_date:
            count = 0
    if (df_merged["Exposure Start day"].loc[a] == 1) & (df_merged["Exposure Start month"].loc[a] == 1):
        if df_merged["Exposure Start day"].loc[a] == 1:
            if df_merged["start end"].loc[a]== 0:
                date = str(df_merged['Record ID'].loc[a]) + '/1/'+ str(df_merged['new_perfect_year'].loc[a])
                Last_column.append(date)
            else:
                date = str(df_merged['Record ID'].loc[a]) + '/16/'+ str(df_merged['new_perfect_year'].loc[a])
                Last_column.append(date)
        else:
            if df_merged["start end"].loc[a]== 0:
                date = str(df_merged['Exposure Start day'].loc[a]) + '/1/'+ str(df_merged['new_perfect_year'].loc[a])
                Last_column.append(date)
            else:
                date = str(df_merged['Record ID'].loc[a]) + '/16/'+ str(df_merged['new_perfect_year'].loc[a])
                Last_column.append(date)
    elif count == 0:
        date = str(df_merged['Exposure Start month'].loc[a]) + "/" +str(df_merged['Exposure Start day'].loc[a]) + "/" + str(df_merged['new_perfect_year'].loc[a])
        Last_column.append(date)
        count = count + 1
    else:
        if df_merged["Exposure Start day"].loc[a] == 1:
            if df_merged["start end"].loc[a]== 0:
                date = str(df_merged['Record ID'].loc[a]) + '/16/'+ str(df_merged['new_perfect_year'].loc[a])
                Last_column.append(date)
            else:
                date = str(df_merged['Record ID'].loc[a]) + '/1/'+ str(df_merged['new_perfect_year'].loc[a])
                Last_column.append(date)
        else:
            if df_merged["start end"].loc[a]== 0:
                date = str(df_merged['Record ID'].loc[a]) + '/16/'+ str(df_merged['new_perfect_year'].loc[a])
                Last_column.append(date)
            else:
                date = str(df_merged['Record ID'].loc[a]) + '/1/'+ str(df_merged['new_perfect_year'].loc[a])
                Last_column.append(date)
    match_date = df_merged["DoB Year"].loc[a] 
        

for a in range(len(df_merged)):
    if (df_merged["Exposure Start day"].loc[a] == 1) & (df_merged["Exposure Start month"].loc[a] == 1):
        if df_merged["Exposure Start day"].loc[a] == 1:
            if df_merged["start end"].loc[a]== 0:
                date = str(df_merged['Record ID'].loc[a]) + '/15/'+ str(df_merged['new_perfect_year'].loc[a])
                second_last_column.append(date)
            else:
                date = '12' + '/31/'+ str(df_merged['new_perfect_year'].loc[a])
                second_last_column.append(date)
        else:
            if df_merged["start end"].loc[a]== 0:
                date = str(df_merged['Exposure Start day'].loc[a]) + '/15/'+ str(df_merged['new_perfect_year'].loc[a])
                second_last_column.append(date)
            else:
                date = '12' + '/31/'+ str(df_merged['new_perfect_year'].loc[a])
                second_last_column.append(date)
    else:
        if df_merged["Exposure Start day"].loc[a] == 1:
            if df_merged["start end"].loc[a]== 0:
                date = '12' + '/31/'+ str(df_merged['new_perfect_year'].loc[a])
                second_last_column.append(date)
            else:
                date = str(df_merged['Record ID'].loc[a]) + '/15/'+ str(df_merged['new_perfect_year'].loc[a])
                second_last_column.append(date)
        else:
            if df_merged["start end"].loc[a]== 0:
                date = '12' + '/31/'+ str(df_merged['new_perfect_year'].loc[a])
                second_last_column.append(date)
            else:
                date = str(df_merged['Record ID'].loc[a]) + '/15/'+ str(df_merged['new_perfect_year'].loc[a])
                second_last_column.append(date)
    match_date = df_merged["DoB Year"].loc[a] 

last = pd.DataFrame(Last_column, columns = ["Last column"])

last_2 = pd.DataFrame(second_last_column, columns = ["Second Last column"])

final_df = pd.concat([df_merged, last], axis = 1)

final_df = pd.concat([final_df, last_2], axis = 1)

final_df

final_df = final_df[["Record ID", "DoB", "Exposure Start", "Last column", "Second Last column"]]

final_df.to_csv("name_final_this_first.csv")

Input Data

Output Data

jn1351
  • 53
  • 7