4

I'm trying to convert dates in a pandas dataframe in the format 01/01/2017 to a python format in order to extract the day of the week. I've been using:

df['Date'] = pd.to_datetime(df['Date'])
df['DOW'] = df['Date'].dt.weekday_name

but the first line takes about 250 µs and I have >1 million dates to convert.

Is there a faster way to do this?

glennsl
  • 28,186
  • 12
  • 57
  • 75
kozowh
  • 93
  • 2
  • 10
  • Do you have a fixed format of the datetime? The docs say that [infer_datetime_format](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html) can give a 5-10 time speedup by setting it as `True`. You don't supply any format string that I can see so far. – roganjosh Mar 13 '17 at 20:08
  • All the dates are in the format 01/01/2017 and I don't care what format they're converted to as long as I can extract the day of the week. I hope this answers your question. – kozowh Mar 13 '17 at 20:10
  • I imagine supplying a format string would be even faster than `infer_datetime_format`. – roganjosh Mar 13 '17 at 20:13
  • WOW. Using infer_datetime_format speeded things up 45-fold. Thanks! – kozowh Mar 13 '17 at 20:18
  • Welcome. As I said in last comment, I would imagine you can get even faster if you explicitly give the format so that it doesn't need to try infer it. Datetimes are ugly; I _love_ the `dateparser` (non-pandas) module for its flexibility in prototyping but the overhead is just crazy. – roganjosh Mar 13 '17 at 20:20

1 Answers1

5

Adding the format argument for pd.to_datetime since you know what the format is will make it ~40x faster.

pd.to_datetime(df['Date'], format='%d/%m/%Y')

takes about 4 seconds for me

Also, it's a bit hacky, but if you know the upper and lower bounds for possible dates, you could join the data with a mapping of all possible strings:

dates = pd.date_range(pd.Timestamp.min, pd.Timestamp.max)
date_mappings = pd.Series(dates, index=dates.strftime('%d/%m/%Y'), name='Date')
df = df.join(date_mappings, on='Date', lsuffix='_str')

This took < 1 second on my computer and could probably be reduced further if you know the bounds of possible dates

some_programmer
  • 3,268
  • 4
  • 24
  • 59
Colin
  • 2,087
  • 14
  • 16
  • "Hacky" is one way to put it; I can't understand what's going on there at all :) Is there a clear reason why `date_range` should be faster than providing a fixed format for the dates? – roganjosh Mar 13 '17 at 20:37
  • This has actually really confused me, is `pandas` `to_datetime` relying on Python? I'll have to look into this further, thanks for pointing it out! – roganjosh Mar 13 '17 at 20:41
  • 1
    The hacky version is making a mapping of date strings to dates for all dates regardless of the input data. So, instead of converting each date string to date for every row in the data, it can merge with the mappings. `pd.date_range` is just giving a sequence of valid dates for which to make a mapping. – Colin Mar 13 '17 at 20:49
  • Aha so this works because there's far more data points with a datetime field than there are unique `datetimes` in the range of possible datetimes? In that case I consider it cunning, not hacky :) – roganjosh Mar 13 '17 at 20:51
  • 1
    Exactly. Alternatively, instead of making a mapping for all dates, you could make the mapping just for the unique date strings in the data: `date_mappings = pd.to_datetime(df.set_index('Date', drop=False).Date.drop_duplicates(), format='%d/%m/%Y')` – Colin Mar 13 '17 at 21:04
  • I like it a lot; unfortunately you already have my upvote so I cannot add again. As I said, I don't think it's hacky once explained, but I wonder if you're aware of any pitfalls that made you say that in the first place? – roganjosh Mar 13 '17 at 21:08
  • 1
    I guess it feels hacky to me because it's not using the "one obvious way" of using `pd.to_datetime` and is doing a roundabout way of getting the result. A possible pitfall would be if there are incorrectly formatted strings, it will return NaT whereas `pd.to_datetime` will raise an error (in versions >= 0.17, in older versions it'll silently keep everything as strings). – Colin Mar 13 '17 at 21:22