DSM shows a great way to deal with intervals using Pandas. Following that pattern, we can combine the start
and end
values into a single column of idx
s with a second column (change
) which equals 1 when the idx
corresponds to a start
, and -1 when the idx
corresponds to an end
.
df = pd.DataFrame(
{'end': [10, 7, 10], 'start': [1, 3, 8], 'type': [1, 1, 2]})
event = pd.melt(df, id_vars=['type'], var_name='change', value_name='idx')
event['change'] = event['change'].map({'start':1, 'end':-1})
event = event.sort_values(by=['idx'])
# type change idx
# 3 1 1 1
# 4 1 1 3
# 1 1 -1 7
# 5 2 1 8
# 0 1 -1 10
# 2 2 -1 10
Now, since we wish to keep track of the type
of intervals, we can use event.pivot
to place each type in its own column. Taking the cumsum
counts the number of intervals covering idx
:
event = event.pivot(index='idx', columns='type', values='change').fillna(0).cumsum(axis=0)
# type 1 2
# idx
# 1 1 0
# 3 2 0
# 7 1 0
# 8 1 1
# 10 0 0
For each type
, we only care about the values which are covered, not how many times it is covered. So let's compute event > 0
to find the values covered:
event = event > 0
# type 1 2
# idx
# 1 True False
# 3 True False
# 7 True False
# 8 True True
# 10 False False
Now we can use searchsorted
to find the desired result:
other = pd.DataFrame({'value': [2, 5, 9]})
idx = event.index.searchsorted(other['value'])-1
other['result'] = event.iloc[idx].sum(axis=1).values
Putting it all together:
import numpy as np
import pandas as pd
df = pd.DataFrame(
{'end': [10, 7, 10], 'start': [1, 3, 8], 'type': [1, 1, 2]})
event = pd.melt(df, id_vars=['type'], var_name='change', value_name='idx')
event['change'] = event['change'].map({'start':1, 'end':-1})
event = event.sort_values(by=['idx'])
event = event.pivot(index='idx', columns='type', values='change').fillna(0).cumsum(axis=0)
event = event > 0
other = pd.DataFrame({'value': [2, 5, 9]})
idx = event.index.searchsorted(other['value'])-1
other['result'] = event.iloc[idx].sum(axis=1).values
print(other)
yields
value result
0 2 1
1 5 1
2 9 2
To check the correctness of the calculation, let's look at
other = pd.DataFrame({'value': np.arange(13)})
Then
idx = event.index.searchsorted(other['value'])-1
other['result'] = event.iloc[idx].sum(axis=1).values
print(other)
yields
value result
0 0 0
1 1 0 <-- The half-open interval (1, 10] does not include 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1 <-- The half-open interval (8, 10] does not include 8
9 9 2
10 10 2
11 11 0
12 12 0
Notice that this method of computation treats the (start, end]
interval as half-open. If you wish to instead use the half-open intervals [start, end)
, use
idx = event.index.searchsorted(other['value'], side='right')-1