0

Converting a large datetime64[D] Series (i.e. 900k rows of a DataFrame column) is taking too long. How can I speed it up?

import pandas as pd

df = pd.DataFrame(['2021-10-01']*900000, columns=['date'])   # 0.025286900 seconds
df = df.assign(date=df['date'].astype('datetime64[D]'))      # 0.105065900 

# Why is converting from datetime to str so slow?
df.assign(date=df['date'].dt.strftime('%Y-%m-%d'))           # 5.600835100 seconds. 

txt = str(df)                                                # 0.006202600
# Converting the entire DataFrame to a str is much faster 
# than converting a column directly, despite a similar display format!

There is a related question, which asks how to convert quickly from str to datetime. But my bottleneck is (surprisingly) the inverse; converting from datetime[D] to str is far too slow.

ChaimG
  • 7,024
  • 4
  • 38
  • 46
  • In the first three, you're performing an assignment, which is costly. In `str(df)` you are not assigning the results to a location in memory -- therefore it's an unfair comparison. Unless I'm misunderstanding...? – ddejohn Mar 16 '22 at 00:14
  • @ddejohn: I changed `str(df)` to `txt = str(df)`. It's still super fast, while `strftime` is very slow. – ChaimG Mar 16 '22 at 00:21
  • Does this answer your question? [Is there a faster date conversion than pd.to\_datetime?](https://stackoverflow.com/questions/42771971/is-there-a-faster-date-conversion-than-pd-to-datetime) – Stephan Mar 16 '22 at 00:30
  • 1
    @Stephan: That question is the inverse; converting `str` to `datetime`. That is working quite fast for for me. It's the conversion back from `datetime` to `str` that is surprisingly taking a long time. – ChaimG Mar 16 '22 at 01:38
  • 1
    `str(df)` just converts the first few and the last few lines of the column. Maybe that's why it's so fast. – ChaimG Mar 16 '22 at 14:51

1 Answers1

1

Here is a much faster solution:

np.datetime_as_string(df['date'], unit='D')   # 0.375799800 seconds.

It's still taking longer than it should, (how does it make sense that converting from datetime is slower than converting from str), but it is much much faster.

ChaimG
  • 7,024
  • 4
  • 38
  • 46
  • 1
    It turns out dates are not as simple to convert as it seems. One has to consider time zones, unicode encoding, the automatic adaptation of the size size based on all the dates. The Numpy code is available [here](https://github.com/numpy/numpy/blob/3544fae2859d8ecac378b9ddc8f909c69455a831/numpy/core/src/multiarray/datetime_strings.c#L1334) and it is quite complex. A simple profiling analysis shows that 40-50% of the time is spent in the libc implementation of Microsoft on Windows and not in the Numpy code... – Jérôme Richard Mar 20 '22 at 22:22
  • @Jérôme Richard: Nice work, thanks! Is there a way to bypass all that overhead? – ChaimG Mar 21 '22 at 01:59
  • Unfortunately, I did not found so far. – Jérôme Richard Apr 20 '22 at 09:25