1

I have a tight loop which, among other things, checks whether a given date (in the form of a pandas.Timestamp) is contained in a given unique pandas.DatetimeIndex (the application being checking whether a date is a custom business day).

As a minimal example, consider this bit:

import pandas as pd

dates = pd.date_range("2020", "2021")
index = dates.to_series().sample(frac=0.7).sort_index().index

for date in dates:
    if date in index:
        # Do stuff...

(Note that simply iterating over index is not an option in the full application)

To my surprise, I found that the date in index bit takes up a significant part of the total runtime. Profiling furthermore shows that Pandas' membership check does a lot more than just a hash lookup, which is further confirmed by a small experiment comparing DatetimeIndex vs a plain python set:

%timeit [date in index for date in dates]
# 3.28 ms ± 81.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

vs

index_set = set(index)
%timeit [date in index_set for date in dates]
# 341 µs ± 3.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Note that the difference is almost 10x! Why this difference and can I do anything to make it faster?

NinjaTuna
  • 41
  • 5

0 Answers0