Numpy efficient way to get number of unique intersecting intervals

Question

I've a data frame with intervals like

start end
1     10
3     7
8     10

and I need to find number of intersections with the other data frame

value
2
5
9

result should be

1
2
2

The seconds part of question is more tricky. My data frame with intervals contains also kind of type

start end type
1     10  1
3     7   1
8     10  2

And I need to know many unique(by type) interval will be intersected. Result should be:

1
1
2

I guess the first part can be done by numpy.searchsorted but what's about the second one?

In the second row the type is 1. If the value in the other data frame is 5, then both [1,10] and [3,7] are intersected. So shouldn't the result (i.e. the number of unique intervals intersected) be `[1,2,2]` rather than `[1,1,2]`? — unutbu, Feb 23 '16 at 11:24
If you did the first part and showed us your code, it would be easier to help you with the second part. — IanS, Feb 23 '16 at 12:02
@unutbu both [1,10] and [3,7] are of the same type (type 1), so there is only one type of interval that 5 intersects. It's not unique intervals but unique types. — IanS, Feb 23 '16 at 12:05

IanS · Answer 1 · 2016-02-23T12:31:21.433

1

Let's call your first dataframe df. For a given value, intersecting intervals can be found as such:

mask = (df['start'] <= value) & (df['end'] >= value)

The following will return the number of intersecting intervals:

mask.sum()

The following will return the number of intersecting types:

len(df['type'][mask].unique())

Now you can apply a lambda function to the values series:

values = pd.Series([2, 5, 9], name=['value'])
values.apply(lambda value: len(df['type'][(df['start'] <= value) & (df['end'] >= value)].unique()))

edited Feb 23 '16 at 12:31

answered Feb 23 '16 at 12:23

IanS

15,771
9
60
84

I'm concerned about efficiency of you code. Can it be implemented in 100% broadcasted way? – sh1ng Feb 23 '16 at 13:19

score 1 · Answer 2 · edited May 23 '17 at 12:07

DSM shows a great way to deal with intervals using Pandas. Following that pattern, we can combine the start and end values into a single column of idxs with a second column (change) which equals 1 when the idx corresponds to a start, and -1 when the idx corresponds to an end.

df = pd.DataFrame(
    {'end': [10, 7, 10], 'start': [1, 3, 8], 'type': [1, 1, 2]})
event = pd.melt(df, id_vars=['type'], var_name='change', value_name='idx')
event['change'] = event['change'].map({'start':1, 'end':-1})
event = event.sort_values(by=['idx'])
#    type  change  idx
# 3     1       1    1
# 4     1       1    3
# 1     1      -1    7
# 5     2       1    8
# 0     1      -1   10
# 2     2      -1   10

Now, since we wish to keep track of the type of intervals, we can use event.pivot to place each type in its own column. Taking the cumsum counts the number of intervals covering idx:

event = event.pivot(index='idx', columns='type', values='change').fillna(0).cumsum(axis=0)
# type  1  2
# idx       
# 1     1  0
# 3     2  0
# 7     1  0
# 8     1  1
# 10    0  0

For each type, we only care about the values which are covered, not how many times it is covered. So let's compute event > 0 to find the values covered:

event = event > 0
# type      1      2
# idx               
# 1      True  False
# 3      True  False
# 7      True  False
# 8      True   True
# 10    False  False

Now we can use searchsorted to find the desired result:

other = pd.DataFrame({'value': [2, 5, 9]})
idx = event.index.searchsorted(other['value'])-1
other['result'] = event.iloc[idx].sum(axis=1).values

Putting it all together:

import numpy as np
import pandas as pd

df = pd.DataFrame(
    {'end': [10, 7, 10], 'start': [1, 3, 8], 'type': [1, 1, 2]})

event = pd.melt(df, id_vars=['type'], var_name='change', value_name='idx')
event['change'] = event['change'].map({'start':1, 'end':-1})
event = event.sort_values(by=['idx'])
event = event.pivot(index='idx', columns='type', values='change').fillna(0).cumsum(axis=0)
event = event > 0
other = pd.DataFrame({'value': [2, 5, 9]})
idx = event.index.searchsorted(other['value'])-1
other['result'] = event.iloc[idx].sum(axis=1).values
print(other)

yields

   value  result
0      2       1
1      5       1
2      9       2

To check the correctness of the calculation, let's look at

other = pd.DataFrame({'value': np.arange(13)})

Then

idx = event.index.searchsorted(other['value'])-1
other['result'] = event.iloc[idx].sum(axis=1).values
print(other)

yields

    value  result
0       0       0
1       1       0   <-- The half-open interval (1, 10] does not include 1
2       2       1
3       3       1
4       4       1
5       5       1
6       6       1
7       7       1
8       8       1   <-- The half-open interval (8, 10] does not include 8
9       9       2
10     10       2
11     11       0
12     12       0

Notice that this method of computation treats the (start, end] interval as half-open. If you wish to instead use the half-open intervals [start, end), use

idx = event.index.searchsorted(other['value'], side='right')-1

I knew we needed unutbu to the rescue :) – IanS Feb 23 '16 at 13:37 — IanS, Feb 23 '16 at 13:37

Numpy efficient way to get number of unique intersecting intervals

2 Answers2