23

I read a csv file containing 150,000 lines into a pandas dataframe. This dataframe has a field, Date, with the dates in yyyy-mm-dd format. I want to extract the month, day and year from it and copy into the dataframes' columns, Month, Day and Year respectively. For a few hundred records the below two methods work ok, but for 150,000 records both take a ridiculously long time to execute. Is there a faster way to do this for 100,000+ records?

First method:

df = pandas.read_csv(filename)
for i in xrange(len(df)): 
   df.loc[i,'Day'] = int(df.loc[i,'Date'].split('-')[2])

Second method:

df = pandas.read_csv(filename)
for i in xrange(len(df)):
   df.loc[i,'Day'] = datetime.strptime(df.loc[i,'Date'], '%Y-%m-%d').day

Thank you.

Chris Tang
  • 567
  • 7
  • 18
ram
  • 507
  • 1
  • 5
  • 11
  • 2
    What's your time limit? Or to remove machine capability from consideration, what factor do you need to speed it up by? I don't think "ridiculously long" is precise enough for someone answering this question to know whether an answer like `int(df.loc[i,'Date'][-2:])` would be fast enough, leaving aside for a moment the issue of whether the data format is reliable enough for it to be considered correct. – Steve Jessop Feb 22 '14 at 12:28
  • I haven't set a time limit. The data is clean-no NaN/NaT entries or messy/incorrectly formatted values. I was unable to think of a third way to do this. Hence I am trying to find out if there is a better logic, or maybe some built-in pandas methods that can speed things on the same machine. – ram Feb 22 '14 at 12:47
  • Have you thought about wrapping a csv reader so that it splits the date column and append day, month and year and create a pandas frame with that iterator. – Maciej Gol Feb 22 '14 at 13:05
  • @Steve: Thank you very much. This is faster than the above two. Is there a way to find out the time taken by each code to execute? – ram Feb 22 '14 at 13:13
  • 1
    @ram: use the `timeit` module. – Steve Jessop Feb 23 '14 at 11:23
  • @kroolik: i haven't used wrappers before. so i wouldn't know. is it native to python? i mean, do we need to download any package to use them? – ram Feb 24 '14 at 16:53

3 Answers3

39

In 0.15.0 you will be able to use the new .dt accessor to do this nice syntactically.

In [36]: df = DataFrame(date_range('20000101',periods=150000,freq='H'),columns=['Date'])

In [37]: df.head(5)
Out[37]: 
                 Date
0 2000-01-01 00:00:00
1 2000-01-01 01:00:00
2 2000-01-01 02:00:00
3 2000-01-01 03:00:00
4 2000-01-01 04:00:00

[5 rows x 1 columns]

In [38]: %timeit f(df)
10 loops, best of 3: 22 ms per loop

In [39]: def f(df):
    df = df.copy()
    df['Year'] = DatetimeIndex(df['Date']).year
    df['Month'] = DatetimeIndex(df['Date']).month
    df['Day'] = DatetimeIndex(df['Date']).day
    return df
   ....: 

In [40]: f(df).head()
Out[40]: 
                 Date  Year  Month  Day
0 2000-01-01 00:00:00  2000      1    1
1 2000-01-01 01:00:00  2000      1    1
2 2000-01-01 02:00:00  2000      1    1
3 2000-01-01 03:00:00  2000      1    1
4 2000-01-01 04:00:00  2000      1    1

[5 rows x 4 columns]

From 0.15.0 on (release in end of Sept 2014), the following is now possible with the new .dt accessor:

df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • 1
    note that you should use the parse_dates kw when reading the csv – Jeff Feb 22 '14 at 19:42
  • jeff, this works when each date is a timestamp (or if 'Date' is a DatetimeIndex. but this gave me the idea to modify your code to `df['Days'] = DatetimeIndex(to_datetime(df['Date'])).day` in order achieve the same result. thanks a ton. and this actually worked in pandas 0.13.1 on my machine. – ram Feb 24 '14 at 18:34
  • 1
    i forgot to mention. this turned out to be the fastest so far. here, the difference between using 'for' over 150,000 elements of 'Date', and using 'DatetimeIndex' over the pandas Series 'Date' was apparent. i think i'll do a 'timeit' comparison and study all three methods. – ram Feb 25 '14 at 05:17
4

I use below code which works very well for me

df['Year']=[d.split('-')[0] for d in df.Date]
df['Month']=[d.split('-')[1] for d in df.Date]
df['Day']=[d.split('-')[2] for d in df.Date]

df.head(5)
Nim J
  • 993
  • 2
  • 9
  • 15
1

This is the cleanest answer I've found.

df = df.assign(**{t:getattr(df.data.dt,t) for t in nomtimes})

In [30]: df = pd.DataFrame({'data':pd.date_range(start, end)})

In [31]: df.head()
Out[31]:
        data
0 2011-01-01
1 2011-01-02
2 2011-01-03
3 2011-01-04
4 2011-01-05

nomtimes = ["year", "hour", "month", "dayofweek"] 
df = df.assign(**{t:getattr(df.data.dt,t) for t in nomtimes})

In [33]: df.head()
Out[33]:
        data  dayofweek  hour  month  year
0 2011-01-01          5     0      1  2011
1 2011-01-02          6     0      1  2011
2 2011-01-03          0     0      1  2011
3 2011-01-04          1     0      1  2011
4 2011-01-05          2     0      1  2011
kevin_theinfinityfund
  • 1,631
  • 17
  • 18