I have a tight loop which, among other things, checks whether a given date (in the form of a pandas.Timestamp
) is contained in a given unique pandas.DatetimeIndex
(the application being checking whether a date is a custom business day).
As a minimal example, consider this bit:
import pandas as pd
dates = pd.date_range("2020", "2021")
index = dates.to_series().sample(frac=0.7).sort_index().index
for date in dates:
if date in index:
# Do stuff...
(Note that simply iterating over index
is not an option in the full application)
To my surprise, I found that the date in index
bit takes up a significant part of the total runtime. Profiling furthermore shows that Pandas' membership check does a lot more than just a hash lookup, which is further confirmed by a small experiment comparing DatetimeIndex
vs a plain python set
:
%timeit [date in index for date in dates]
# 3.28 ms ± 81.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
vs
index_set = set(index)
%timeit [date in index_set for date in dates]
# 341 µs ± 3.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Note that the difference is almost 10x! Why this difference and can I do anything to make it faster?