This solution uses piso
(pandas interval set operations) and will run fast.
setup
Create data and convert to pandas.Timedelta
df = pd.DataFrame(
{
"start": ["00:00", "00:10", "00:00", "00:00", "00:10", "00:15"],
"end": ["00:10", "00:20", "00:20", "00:10", "00:15", "00:20"],
"name": ["a", "b", "c", "d", "e", "a"],
}
)
df[["start", "end"]] = ("00:" + df[["start", "end"]].astype(str)).apply(pd.to_timedelta)
create the sample times (a pandas.TimedeltaIndex
of seconds):
sample_times = pd.timedelta_range(df["start"].min(), df["end"].max(), freq="s")
solution
For each possible value of "name" create a pandas.IntervalIndex
which has the intervals defined by start and stop columns:
ii_series = df.groupby("name").apply(
lambda d: pd.IntervalIndex.from_arrays(d["start"], d["end"], closed="left")
)
ii_series
looks like this:
name
a IntervalIndex([[0 days 00:00:00, 0 days 00:10:...
b IntervalIndex([[0 days 00:10:00, 0 days 00:20:...
c IntervalIndex([[0 days 00:00:00, 0 days 00:20:...
d IntervalIndex([[0 days 00:00:00, 0 days 00:10:...
e IntervalIndex([[0 days 00:10:00, 0 days 00:15:...
dtype: object
Then to each of these interval index we'll apply the piso.contains
function, which can be used to test whether a set of points is contained in an interval
contained = ii_series.apply(piso.contains,x=sample_times, result="points")
contained
will be a dataframe indexed by the names, and whose columns are the sample times. The transpose of this, looks like:
a b c d e
0 days 00:00:00 True False True True False
0 days 00:00:01 True False True True False
0 days 00:00:02 True False True True False
0 days 00:00:03 True False True True False
0 days 00:00:04 True False True True False
... ... ... ... ... ...
0 days 00:19:56 True True True False False
0 days 00:19:57 True True True False False
0 days 00:19:58 True True True False False
0 days 00:19:59 True True True False False
0 days 00:20:00 False False False False False
This format of data may be easier to work with, depending on the application, but if you want to have it in the format stated in the question then you can create a series of lists, indexed by each second:
series_of_lists = (
contained.transpose()
.melt(ignore_index=False)
.query("value == True")
.reset_index()
.groupby("index")["name"]
.apply(pd.Series.to_list)
)
Then convert to dataframe:
pd.DataFrame(series_of_lists.to_list(), index=series_of_lists.index)
which will look like this:
0 1 2
index
0 days 00:00:00 a c d
0 days 00:00:01 a c d
0 days 00:00:02 a c d
0 days 00:00:03 a c d
0 days 00:00:04 a c d
... .. .. ..
0 days 00:19:55 a b c
0 days 00:19:56 a b c
0 days 00:19:57 a b c
0 days 00:19:58 a b c
0 days 00:19:59 a b c
Note: I am the creator of piso, feel free to reach out if you have any questions.