Converting dataframe column of years to month day year

Question

I'm doing this for homework.

My goal is to have an entirely new column with just the days elapsed. There are 500,000+ rows of this...so my goal is to:

In the Pandas dataframe, I have these two date columns which are in different formats. I'd like to subtract these two columns, and then create a new 'Days Elapsed' column which is a simple integer list.
Output into new CSV (this code is done)
Now I can completely avoid parsing dates every time I rework the code/read the CSV because it's taking a crazy long time and slowing my work down.

I'm trying to convert this:

   Yearmade         Saledate
0      2004  11/16/2006 0:00
1      1996   3/26/2004 0:00
2      2001   2/26/2004 0:00

Into:

       Days Elapsed
0      1050
1      3007
2      1151

Current attempt:

year_mean = df[df['YearMade'] > 1000]['YearMade'].mean()
df.loc[df['YearMade'] == 1000, 'YearMade'] = year_mean
## There's lots of erroneous data of the year 1000, so replacing all of them with the mean of the column (mean of column without error data, that is)
df['Yearmade'] = "1/1/"+df['YearMade'].astype(str)
## This is where it errors out.
df['Yearmade'] = pd.to_datetime(df['Yearmade'])
df['Saledate'] = pd.to_datetime(df['Saledate'])
df['Age_at_Sale'] = df['Saledate'].sub(df['Yearmade'])
df = df.drop(['Saledate', 'Yearmade'], axis=1)

[then there's another class method to convert the current df into csv]

I assume you know this, but CSV is very slow and inefficient (storage wise) compared to various binary formats (HDF, etc.) Pandas has lots of options: http://pandas.pydata.org/pandas-docs/stable/io.html — JohnE, Dec 11 '16 at 20:49
Just started learning pandas/CSV wrangling this week - now learning about HDF with your link! — jhub1, Dec 11 '16 at 20:52
@JohnE I have to do a lot of printing to confirm if I'm cleaning dataframes correctly. Would it be advisable to convert a large CSV into HDF...first, then start wrangling it? — jhub1, Dec 11 '16 at 21:26
Shouldn't matter much, just don't re-save as CSV. ;-) That's what takes the most time, so try to only do it once. — JohnE, Dec 11 '16 at 21:48

MaxU - stand with Ukraine · Accepted Answer · 2016-12-11T22:34:51.563

1

assuming you have the following DF:

In [203]: df
Out[203]:
   Yearmade   Saledate
0      2004 2006-11-16
1      1996 2004-03-26
2      2001 2004-02-26
3      1000 2003-12-23     # <--- erroneous year

Solution:

In [204]: df.loc[df.Yearmade <= 1900, 'Yearmade'] = round(df.Yearmade.loc[df.Yearmade > 1900].mean())

In [205]: df
Out[205]:
   Yearmade   Saledate
0      2004 2006-11-16
1      1996 2004-03-26
2      2001 2004-02-26
3      2000 2003-12-23    # <--- replaced with avg. year 

In [206]: df['days'] = (pd.to_datetime(Saledate) - pd.to_datetime(df.Yearmade, format='%Y')).dt.days

In [207]: df
Out[207]:
   Yearmade   Saledate  days
0      2004 2006-11-16  1050
1      1996 2004-03-26  3007
2      2001 2004-02-26  1151
3      2000 2003-12-23  1452

edited Dec 11 '16 at 22:34

answered Dec 11 '16 at 20:19

MaxU - stand with Ukraine

205,989
36
386
419

Yikes. It errors out because there's a lot of erroneous data like year 1000. I just tried replacing the year with means but it doesn't work still. year_mean = self.df[self.df['YearMade'] > 1000]['YearMade'].mean() # print 'Mean is {}'.format(year_mean) self.df.loc[self.df['YearMade'] == 1000, 'YearMade'] = year_mean pandas.tslib.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1000-01-01 00:00:00 – jhub1 Dec 11 '16 at 20:25
@jhub1, what are you going to do with erroneous years? – MaxU - stand with Ukraine Dec 11 '16 at 20:36
So I put more code up there but my current thinking is this: Get the mean of all years greater than 1900. Replace all years below 1900 with the mean. – jhub1 Dec 11 '16 at 20:40
@jhub1, yeah, go ahead! – MaxU - stand with Ukraine Dec 11 '16 at 20:53
I'm still debugging - I think I found the problem though. Your first line creates a 'YearMade' column that is dtype: float64. The years all look like 2004.0. Then when we try to convert 2004.0 to datetime it will error out: pd.to_datetime(self.df['YearMade'], format='%Y').dt.days – jhub1 Dec 11 '16 at 21:47
ok i fixed it - instead of df.Saledate, I put pd.to_datetime(Saledate). Saledate wasn't actually date parsed yet, it just looks like it was. Once you fix your answer, I'll tag yours as the answer. Thanks for your help. – jhub1 Dec 11 '16 at 22:33

Converting dataframe column of years to month day year

1 Answers1