0

I'm doing this for homework.

My goal is to have an entirely new column with just the days elapsed. There are 500,000+ rows of this...so my goal is to:

  1. In the Pandas dataframe, I have these two date columns which are in different formats. I'd like to subtract these two columns, and then create a new 'Days Elapsed' column which is a simple integer list.
  2. Output into new CSV (this code is done)
  3. Now I can completely avoid parsing dates every time I rework the code/read the CSV because it's taking a crazy long time and slowing my work down.

I'm trying to convert this:

   Yearmade         Saledate
0      2004  11/16/2006 0:00
1      1996   3/26/2004 0:00
2      2001   2/26/2004 0:00

Into:

       Days Elapsed
0      1050
1      3007
2      1151

Current attempt:

year_mean = df[df['YearMade'] > 1000]['YearMade'].mean()
df.loc[df['YearMade'] == 1000, 'YearMade'] = year_mean
## There's lots of erroneous data of the year 1000, so replacing all of them with the mean of the column (mean of column without error data, that is)
df['Yearmade'] = "1/1/"+df['YearMade'].astype(str)
## This is where it errors out.
df['Yearmade'] = pd.to_datetime(df['Yearmade'])
df['Saledate'] = pd.to_datetime(df['Saledate'])
df['Age_at_Sale'] = df['Saledate'].sub(df['Yearmade'])
df = df.drop(['Saledate', 'Yearmade'], axis=1)

[then there's another class method to convert the current df into csv]
jhub1
  • 611
  • 3
  • 7
  • 19
  • I assume you know this, but CSV is very slow and inefficient (storage wise) compared to various binary formats (HDF, etc.) Pandas has lots of options: http://pandas.pydata.org/pandas-docs/stable/io.html – JohnE Dec 11 '16 at 20:49
  • Just started learning pandas/CSV wrangling this week - now learning about HDF with your link! – jhub1 Dec 11 '16 at 20:52
  • @JohnE I have to do a lot of printing to confirm if I'm cleaning dataframes correctly. Would it be advisable to convert a large CSV into HDF...first, then start wrangling it? – jhub1 Dec 11 '16 at 21:26
  • 1
    Shouldn't matter much, just don't re-save as CSV. ;-) That's what takes the most time, so try to only do it once. – JohnE Dec 11 '16 at 21:48

1 Answers1

1

assuming you have the following DF:

In [203]: df
Out[203]:
   Yearmade   Saledate
0      2004 2006-11-16
1      1996 2004-03-26
2      2001 2004-02-26
3      1000 2003-12-23     # <--- erroneous year 

Solution:

In [204]: df.loc[df.Yearmade <= 1900, 'Yearmade'] = round(df.Yearmade.loc[df.Yearmade > 1900].mean())

In [205]: df
Out[205]:
   Yearmade   Saledate
0      2004 2006-11-16
1      1996 2004-03-26
2      2001 2004-02-26
3      2000 2003-12-23    # <--- replaced with avg. year 

In [206]: df['days'] = (pd.to_datetime(Saledate) - pd.to_datetime(df.Yearmade, format='%Y')).dt.days

In [207]: df
Out[207]:
   Yearmade   Saledate  days
0      2004 2006-11-16  1050
1      1996 2004-03-26  3007
2      2001 2004-02-26  1151
3      2000 2003-12-23  1452
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • Yikes. It errors out because there's a lot of erroneous data like year 1000. I just tried replacing the year with means but it doesn't work still. year_mean = self.df[self.df['YearMade'] > 1000]['YearMade'].mean() # print 'Mean is {}'.format(year_mean) self.df.loc[self.df['YearMade'] == 1000, 'YearMade'] = year_mean pandas.tslib.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1000-01-01 00:00:00 – jhub1 Dec 11 '16 at 20:25
  • @jhub1, what are you going to do with erroneous years? – MaxU - stand with Ukraine Dec 11 '16 at 20:36
  • So I put more code up there but my current thinking is this: Get the mean of all years greater than 1900. Replace all years below 1900 with the mean. – jhub1 Dec 11 '16 at 20:40
  • @jhub1, yeah, go ahead! – MaxU - stand with Ukraine Dec 11 '16 at 20:53
  • I'm still debugging - I think I found the problem though. Your first line creates a 'YearMade' column that is dtype: float64. The years all look like 2004.0. Then when we try to convert 2004.0 to datetime it will error out: pd.to_datetime(self.df['YearMade'], format='%Y').dt.days – jhub1 Dec 11 '16 at 21:47
  • ok i fixed it - instead of df.Saledate, I put pd.to_datetime(Saledate). Saledate wasn't actually date parsed yet, it just looks like it was. Once you fix your answer, I'll tag yours as the answer. Thanks for your help. – jhub1 Dec 11 '16 at 22:33