1

Working on the NYC car crash dataset. The current csv has data from 2012 and I wanted to extract out the 2017 values. Thing is DATE is an object dtype and trying something like this ended up taking too long

data = data[(pd.to_datetime((data['DATE'])).dt.year == 2017)]

Here is how a sample of the data looks

Head of the dataset

Does anyone possibly know of a faster or more efficient way? Thank you.

jpp
  • 159,742
  • 34
  • 281
  • 339

1 Answers1

2

The bottleneck is almost certainly in datetime conversion.

To improve performance, you can utilise the infer_datetime_format argument of pd.to_datetime. As per the docs:

infer_datetime_format : boolean, default False

If True and no format is given, attempt to infer the format of the datetime strings, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by ~5-10x.

For example:

data['DATE'] = pd.to_datetime(data['DATE'], infer_datetime_format=True)
data = data[data['DATE'].dt.year == 2017)]

Alternatively, you can specify the precise format:

data['DATE'] = pd.to_datetime(data['DATE'], format='%m/%d/%Y')
data = data[data['DATE'].dt.year == 2017)]
jpp
  • 159,742
  • 34
  • 281
  • 339
  • 1
    Thanks so much jpp I didnt know about infer_datetime_format that really did help and speed things up. – BearsBeetBattlestar Apr 29 '18 at 01:31
  • @laughingllama42, or just specify the date format explicitly when calling `pd.to_datetime(data.DATE, format='%m/%d/%Y')`. It took 3 seconds to convert with the format specified, but 144 seconds without it specified. – ALollz Apr 29 '18 at 01:34
  • 1
    @ALollz, Thanks - also added a link to the `datetime` abbreviation page. – jpp Apr 29 '18 at 01:37
  • @jpp is there any way to modify this to also add in the time column? – BearsBeetBattlestar Apr 29 '18 at 01:59
  • @laughingllama42, Yes. There are many different formats you can incorporate, [this link](http://strftime.org/) is a complete guide. See [here](https://stackoverflow.com/questions/17978092/combine-date-and-time-columns-using-python-pandas) for how to combine 2 columns. – jpp Apr 29 '18 at 02:00