0

I have a pandas dataframe INT with two integer columns, START and END, representing intervals [START, END]. I need to check if an integer POS falls in one of these intervals, that is, if there exists a row where START <= POS <= END. I need to do that for hundreds of thousands POS and I have thousands of intervals. Everything is sorted, both the intervals and the POS values.

I have what I think is an efficient solution, checking the POS values in order and keeping track of the last closest intervals, so that I can start hopefully close to the interval I want (if it exists) and I only have to go forward in the interval table to check if there is one:

max_inter = max(intervals.END - intervals.START)
last_index = 0
def find(POS):
  global last_index, max_inter, intervals
  i = last_index
  while i > 0 and intervals.START.iloc[i] > POS - max_inter:
    i -= 1
  found = False
  while i < len(intervals) - 1 and intervals.START.iloc[i]) <= POS:
    if intervals.START.iloc[i] <= POS and POS <= intervals.END.iloc[i]:
      found = True
      break
    i += 1
  last_index = i
  return found

However this is slower than what I would like since it is in pure python, is there an efficient way to do that in pandas or numpy?

I already tried something like

any((intervals.START <= POS) & (POS <= intervals.END))

but this is much slower than my solution. Any suggestion? Am I missing a library function?

Thanks

Edit: Maybe I should mention that I have a (sorted) Series of POS values that I need to check, I am currently using positions.map(find) to produce a boolean series, maybe there are better ways of doing this. Also I have to do this for thousands of positions and intervals, that is why I am interested in speed.

The Unfun Cat
  • 29,987
  • 31
  • 114
  • 156
Giovo
  • 43
  • 5
  • doesn't `intervals[(intervals['START'] <= POS) & (POS <= intervals['END'])]` work? – EdChum Oct 29 '14 at 08:37
  • @EdChum it works but it is much slower, even though it uses library functions, since it needs to check the entire intervals table everytime. I need something more efficient. – Giovo Oct 29 '14 at 09:09
  • Another thought would be to order START and END like you have already and perform a `np.searchsorted` on START and END passing in POS and then compare if the index position is within this range if you understand what I mean – EdChum Oct 29 '14 at 09:13

2 Answers2

0

You may use boolean indexing, like the answer from this SO answer

Personally, I would use eval which is very efficient for large array, something like this:

import pandas as pd

df = pd.DataFrame([[4,9],[2,5],[3,6],[1,4]], columns=['start','end'])
df
   start  end
0      4    9
1      2    5
2      3    6
3      1    4

pos = 3

df[df.eval('(start <= {}) & ({} <= end)'.format(pos,pos))]
   start  end
1      2    5
2      3    6
3      1    4
Community
  • 1
  • 1
Anzel
  • 19,825
  • 5
  • 51
  • 52
  • i tried your suggestion on a small subset of my data, it takes 13 minutes, while my solution takes 26 seconds. I guess eval _is_ very efficient, but it wastes time by checking the whole df everytime, while I check only a handfuls of intervals at a time. – Giovo Oct 29 '14 at 11:11
  • @Giovo, are you satisfied with the efficiency with the suggestion? – Anzel Oct 29 '14 at 11:13
  • @Giovo, hope you will find a better solution, there must be someone knowledgeable to come up with a solution which has both efficiency and performance ;) – Anzel Oct 29 '14 at 11:22
  • @Giovo, have you considered enhancing the performance with **cython**? – Anzel Oct 29 '14 at 11:24
0

While this isn't pure pandas, it is super fast. The NCLS is used to find ~50 million overlaps in < 5 seconds.

Install:

# pip install ncls
# or
# conda install -c bioconda ncls

Setup:

import numpy as np
np.random.seed(0)
import pandas as pd

size = int(1e6)

dtype = np.int32
start = np.random.randint(int(1e7), size=size, dtype=dtype)
end = start + np.random.randint(int(1e3), size=size, dtype=dtype)

start2 = np.random.randint(int(1e7), size=size, dtype=dtype)
end2 = start2 + 1

intervals = pd.DataFrame({"Start": start, "End": end})
#           Start      End
# 0       8325804  8326332
# 1       1484405  1485343
# 2       2215104  2215531
# 3       5157699  5157834
# 4       8222403  8222497
# ...         ...      ...
# 999995  2981746  2982673
# 999996  1453668  1454251
# 999997  3325111  3325135
# 999998  4311711  4312465
# 999999  8089671  8090277
# 
# [1000000 rows x 2 columns]
points = pd.DataFrame({"Start": start2, "End": end2})
#           Start      End
# 0       1714420  1714421
# 1        980607   980608
# 2       5566444  5566445
# 3       2788107  2788108
# 4       6145575  6145576
# ...         ...      ...
# 999995  1824809  1824810
# 999996  6135851  6135852
# 999997  5190341  5190342
# 999998  7403307  7403308
# 999999  9732498  9732499
# 
# [1000000 rows x 2 columns]

Execution:

from ncls import NCLS
n = NCLS(intervals.Start.values, intervals.End.values, intervals.index.values)
# Wall time: 421 ms

p_ix, i_ix = n.all_overlaps_both(points.Start.values, points.End.values, points.index.values)
# Wall time: 4.4s

len(i_ix) / 1e6
# 49.895545

i = intervals.reindex(i_ix).reset_index(drop=True)
p = points.reindex(p_ix).reset_index(drop=True)
p.columns = ["PStart", "PEnd"]
result = pd.concat([i, p], axis=1)
print(result)

#             Start      End   PStart     PEnd
# 0         1713535  1714442  1714420  1714421
# 1         1713560  1714479  1714420  1714421
# 2         1713670  1714590  1714420  1714421
# 3         1713677  1714666  1714420  1714421
# 4         1713694  1714627  1714420  1714421
# ...           ...      ...      ...      ...
# 49895540  9732449  9732910  9732498  9732499
# 49895541  9732491  9733159  9732498  9732499
# 49895542  9732492  9732621  9732498  9732499
# 49895543  9732496  9732653  9732498  9732499
# 49895544  9732496  9732512  9732498  9732499
# 
# [49895545 rows x 4 columns]
The Unfun Cat
  • 29,987
  • 31
  • 114
  • 156