I have a pandas dataframe INT with two integer columns, START and END, representing intervals [START, END]. I need to check if an integer POS falls in one of these intervals, that is, if there exists a row where START <= POS <= END. I need to do that for hundreds of thousands POS and I have thousands of intervals. Everything is sorted, both the intervals and the POS values.
I have what I think is an efficient solution, checking the POS values in order and keeping track of the last closest intervals, so that I can start hopefully close to the interval I want (if it exists) and I only have to go forward in the interval table to check if there is one:
max_inter = max(intervals.END - intervals.START)
last_index = 0
def find(POS):
global last_index, max_inter, intervals
i = last_index
while i > 0 and intervals.START.iloc[i] > POS - max_inter:
i -= 1
found = False
while i < len(intervals) - 1 and intervals.START.iloc[i]) <= POS:
if intervals.START.iloc[i] <= POS and POS <= intervals.END.iloc[i]:
found = True
break
i += 1
last_index = i
return found
However this is slower than what I would like since it is in pure python, is there an efficient way to do that in pandas or numpy?
I already tried something like
any((intervals.START <= POS) & (POS <= intervals.END))
but this is much slower than my solution. Any suggestion? Am I missing a library function?
Thanks
Edit: Maybe I should mention that I have a (sorted) Series of POS values that I need to check, I am currently using positions.map(find)
to produce a boolean series, maybe there are better ways of doing this. Also I have to do this for thousands of positions and intervals, that is why I am interested in speed.