0

I am parsing csv files with a significant number of rows, and containing dates I would like to parse.

I read first the csv file, and then I am using pd.to_datetime() to transform string into Timestamps.

Here is what looks like the strings, and the format I tried to use.

In [8]: ts_temp
Out[8]: 
0     Sun Dec 22 2019 07:40:00 GMT+0100
1     Sun Dec 22 2019 07:45:00 GMT+0100
2     Sun Dec 22 2019 07:50:00 GMT+0100

date_format = "%a %b %d %Y %H:%M:%S %Z"
index = pd.to_datetime(ts_temp, utc = True, format=date_format)

Unfortunately, I then get this error message.

ValueError: unconverted data remains: 100

I can confirm using infer_datetime_format = True instead works, with correct timezone reading, but it seems to me that it does take time.

I would have liked to see if I can improve running time by specifying directly the format.

Thanks for any help, bests!

pierre_j
  • 895
  • 2
  • 11
  • 26

2 Answers2

0

You can let pandas do the heavy lifting on time format by removing the format option:

pd.to_datetime(tx_temp, utc=True)

Output:

0   2019-12-22 08:40:00+00:00
1   2019-12-22 08:45:00+00:00
2   2019-12-22 08:50:00+00:00
Name: 1, dtype: datetime64[ns, UTC]
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Hi @Quand Hoang. Thanks for your reply. As said, I am looking for performance improvement, which I understand I can have if I specify directly the format to pandas (maybe I am wrong, but I have not been able to check as I am unable to specify correctly the format) Thanks for your help nonetheless. – pierre_j May 08 '20 at 04:54
0

Ok, I finally found out. The correct format is: date_format = "%a %b %d %Y %H:%M:%S GMT%z"

And it seems to be about 40% faster using it than 'classical' infer.

pierre_j
  • 895
  • 2
  • 11
  • 26