I'm doing this for homework.
My goal is to have an entirely new column with just the days elapsed. There are 500,000+ rows of this...so my goal is to:
- In the Pandas dataframe, I have these two date columns which are in different formats. I'd like to subtract these two columns, and then create a new 'Days Elapsed' column which is a simple integer list.
- Output into new CSV (this code is done)
- Now I can completely avoid parsing dates every time I rework the code/read the CSV because it's taking a crazy long time and slowing my work down.
I'm trying to convert this:
Yearmade Saledate
0 2004 11/16/2006 0:00
1 1996 3/26/2004 0:00
2 2001 2/26/2004 0:00
Into:
Days Elapsed
0 1050
1 3007
2 1151
Current attempt:
year_mean = df[df['YearMade'] > 1000]['YearMade'].mean()
df.loc[df['YearMade'] == 1000, 'YearMade'] = year_mean
## There's lots of erroneous data of the year 1000, so replacing all of them with the mean of the column (mean of column without error data, that is)
df['Yearmade'] = "1/1/"+df['YearMade'].astype(str)
## This is where it errors out.
df['Yearmade'] = pd.to_datetime(df['Yearmade'])
df['Saledate'] = pd.to_datetime(df['Saledate'])
df['Age_at_Sale'] = df['Saledate'].sub(df['Yearmade'])
df = df.drop(['Saledate', 'Yearmade'], axis=1)
[then there's another class method to convert the current df into csv]