3

I have a large pandas DataFrame (around 1050000 entries). One of the columns is of type datetime. I want to extract year, month and weekday. The problem is that the code shown below is extremely slow:

df['Year'] = pd.DatetimeIndex(df.Date).year
df['Month'] = pd.DatetimeIndex(df.Date).month
df['Weekday'] = pd.DatetimeIndex(df.Date).weekday

Update:

The data looks like this:

      Id  DayOfWeek       Date
0      1          5 2015-07-31   
1      2          4 2015-07-30   
2      3          3 2015-07-29   
3      4          2 2015-07-28  
4      5          1 2015-07-27 

If I do this way:

df = pd.read_csv("data.csv", parse_dates=[2])

df['Year'] = pd.to_datetime(df['Date']).year
df['Month'] = pd.to_datetime(df['Date']).month
df['Weekday'] = pd.to_datetime(df['Date']).weekday

then the error is:

AttributeError: 'Series' object has no attribute 'year'
Alex Riley
  • 169,130
  • 45
  • 262
  • 238
Klausos Klausos
  • 15,308
  • 51
  • 135
  • 217

2 Answers2

3

You state that your column is already of the datetime64 type. In that case you can simply use the .dt accessor to expose the methods and attributes associated with the datetime values in the column:

df['Year'] = df.Date.dt.year

This will be much quicker than writing pd.DatetimeIndex(df.Date).year which creates a whole new index object first.

Alex Riley
  • 169,130
  • 45
  • 262
  • 238
  • It says AttributeError: Can only use .dt accessor with datetimelike values – Klausos Klausos Oct 02 '15 at 19:46
  • Ah - then you don't have a column with the datetime64 dtype after all. You need to convert the column; you can use `df['Date'] = pd.to_datetime(df['Date'])` as suggested in qmorgan's answer. – Alex Riley Oct 02 '15 at 19:50
  • Then it says ValueError: New date column already in dict Date... I am reading csv this way: df = pd.read_csv("data.csv", dtype=dtypestrain), where dtypestrain = { 'Id': int, 'DayOfWeek': int, 'Date': dt.datetime} – Klausos Klausos Oct 02 '15 at 19:52
  • And also AttributeError: 'module' object has no attribute 'parsedDate' – Klausos Klausos Oct 02 '15 at 19:55
  • It's much easier just to let `read_csv` infer the column dtypes. Try using `pd.read_csv("data.csv", parse_dates=[2])` to have the function parse the datetime values when it reads the data. – Alex Riley Oct 02 '15 at 19:59
  • Could you please see my update? I posted the code and the problem. – Klausos Klausos Oct 02 '15 at 20:07
  • @KlausosKlausos: you forgot the `.dt` accessor: use `.dt.year`. Also you may want avoid calling `pd.to_datetime(df['Date'])` each time - just to it once and then get each attribute in turn. – Alex Riley Oct 02 '15 at 20:23
2

It seems like you may be parsing the dates each time rather than all at once. Also, using the to_datetime() method may be faster.

Try

df['parsedDate'] = pd.to_datetime(df['Date'])
df['Year'] = pd.parsedDate.year
df['Month'] = pd.parsedDate.month
df['Weekday'] = pd.parsedDate.weekday
qmorgan
  • 4,794
  • 2
  • 19
  • 14
  • It says ValueError: New date column already in dict Date... I am reading csv this way: df = pd.read_csv("data.csv", dtype=dtypestrain), where dtypestrain = { 'Id': int, 'DayOfWeek': int, 'Date': dt.datetime} – Klausos Klausos Oct 02 '15 at 19:55
  • Or AttributeError: 'module' object has no attribute 'parsedDate' – Klausos Klausos Oct 02 '15 at 19:55